[jira] [Commented] (SPARK-4328) Python serialization updates make Python ML API more brittle to types

Joseph K. Bradley (JIRA) Mon, 10 Nov 2014 14:45:07 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205458#comment-14205458
 ]


Joseph K. Bradley commented on SPARK-4328:
------------------------------------------

[~atalwalkar]  Thanks for pointing this out!

> Python serialization updates make Python ML API more brittle to types
> ---------------------------------------------------------------------
>
>                 Key: SPARK-4328
>                 URL: https://issues.apache.org/jira/browse/SPARK-4328
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib, PySpark
>    Affects Versions: 1.2.0
>            Reporter: Joseph K. Bradley
>
> In Spark 1.1, you could create a LabeledPoint with labels specified as 
> integers, and then use it with LinearRegression.  This was broken by the 
> Python API updates since then.  E.g., this code runs in the 1.1 branch but 
> not in the current master:
> {code}
> from pyspark.mllib.regression import *
> import numpy
> features = numpy.ndarray((3))
> data = sc.parallelize([LabeledPoint(1, features)])
> LinearRegressionWithSGD.train(data)
> {code}
> Recommendation: Allow users to use integers from Python.
> The error message you get is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o55.trainLinearRegressionModelWithSGD.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
> in stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0 
> (TID 15, localhost): java.lang.ClassCastException: java.lang.Integer cannot 
> be cast to java.lang.Double
>       at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
>       at 
> org.apache.spark.mllib.api.python.SerDe$LabeledPointPickler.construct(PythonMLLibAPI.scala:727)
>       at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
>       at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
>       at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
>       at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
>       at 
> org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:804)
>       at 
> org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:803)
>       at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>       at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1309)
>       at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
>       at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
>       at 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
>       at 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>       at org.apache.spark.scheduler.Task.run(Task.scala:56)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-4328) Python serialization updates make Python ML API more brittle to types

Reply via email to