[Scikit-learn-general] 回复：using joblib.dump function in hadoop stream mode

最后的魔杰 Thu, 18 Jun 2015 06:26:17 -0700

A follow-up question:
Instead of using pickle or joblib.dump approaches, is it possible to export 
model.coef_ values and use these values to predict new unlabeled files?





------------------ 原始邮件 ------------------
发件人: "507562032";<507562...@qq.com>;
发送时间: 2015年6月18日(星期四) 晚上8:12
收件人: "scikit-learn-general"<scikit-learn-general@lists.sourceforge.net>; 

主题: using joblib.dump function in hadoop stream mode



Hi Experts,


Very glad to know the existence of this email alias, in which scikit-learners 
are discussing interesting topics.
I'm currently working on pipe-lining a logistic regression model in hadoop 
stream mode.


$HADOOP_HOME/bin/hadoop jar 
$HADOOP_HOME/jars/hadoop-streaming-2.6.0-cdh5.4.2.jar -D mapred.map.tasks=1 
-file ${program_path}/sckl_LR_train_mapper.py -mapper "python 
sckl_LR_train_mapper.py xx 10" -file ${program_path}/sckl_LR_train_mapper.py 
-reducer "python sckl_LR_train_reducer.py xx 10"-input 
/user/hive/warehouse/Classification/Logistic_Regression/input/20150615/horseColicTraining.txt
 -output /output/LR/06181913


I met with a problem when trying to dump a trained model using joblib.dump() in 
sckl_LR_train_mapper.py.
    logisticRegression = linear_model.LogisticRegression()
    model = logisticRegression.fit(train_features, train_targets)
    joblib.dump(model,"/home/models/model.pkl",compress=9)


However, errors were encountered as follows:
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess 
failed with code 1
      at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
      at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
      at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
      at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
      at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
      at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:415)
      at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)



At the beginning I thought in hadoop streaming mode, the system can't recognize 
the local directory /home/models/model.pkl, but only points to hdfs locations, 
and I tried
       joblib.dump(model,sys.stdout,compress=9)
But the same error was reported....
I'm wondering how to serialize model training, model saving and predicting with 
the trained model in this case.
Could anyone help out please?


Jackie

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] 回复：using joblib.dump function in hadoop stream mode

Reply via email to