A follow-up question:
Instead of using pickle or joblib.dump approaches, is it possible to export
model.coef_ values and use these values to predict new unlabeled files?
------------------ 原始邮件 ------------------
发件人: "507562032";<507562...@qq.com>;
发送时间: 2015年6月18日(星期四) 晚上8:12
收件人: "scikit-learn-general"<scikit-learn-general@lists.sourceforge.net>;
主题: using joblib.dump function in hadoop stream mode
Hi Experts,
Very glad to know the existence of this email alias, in which scikit-learners
are discussing interesting topics.
I'm currently working on pipe-lining a logistic regression model in hadoop
stream mode.
$HADOOP_HOME/bin/hadoop jar
$HADOOP_HOME/jars/hadoop-streaming-2.6.0-cdh5.4.2.jar -D mapred.map.tasks=1
-file ${program_path}/sckl_LR_train_mapper.py -mapper "python
sckl_LR_train_mapper.py xx 10" -file ${program_path}/sckl_LR_train_mapper.py
-reducer "python sckl_LR_train_reducer.py xx 10"-input
/user/hive/warehouse/Classification/Logistic_Regression/input/20150615/horseColicTraining.txt
-output /output/LR/06181913
I met with a problem when trying to dump a trained model using joblib.dump() in
sckl_LR_train_mapper.py.
logisticRegression = linear_model.LogisticRegression()
model = logisticRegression.fit(train_features, train_targets)
joblib.dump(model,"/home/models/model.pkl",compress=9)
However, errors were encountered as follows:
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
failed with code 1
at
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
At the beginning I thought in hadoop streaming mode, the system can't recognize
the local directory /home/models/model.pkl, but only points to hdfs locations,
and I tried
joblib.dump(model,sys.stdout,compress=9)
But the same error was reported....
I'm wondering how to serialize model training, model saving and predicting with
the trained model in this case.
Could anyone help out please?
Jackie
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general