[
https://issues.apache.org/jira/browse/SPARK-4328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205458#comment-14205458
]
Joseph K. Bradley commented on SPARK-4328:
------------------------------------------
[~atalwalkar] Thanks for pointing this out!
> Python serialization updates make Python ML API more brittle to types
> ---------------------------------------------------------------------
>
> Key: SPARK-4328
> URL: https://issues.apache.org/jira/browse/SPARK-4328
> Project: Spark
> Issue Type: Improvement
> Components: MLlib, PySpark
> Affects Versions: 1.2.0
> Reporter: Joseph K. Bradley
>
> In Spark 1.1, you could create a LabeledPoint with labels specified as
> integers, and then use it with LinearRegression. This was broken by the
> Python API updates since then. E.g., this code runs in the 1.1 branch but
> not in the current master:
> {code}
> from pyspark.mllib.regression import *
> import numpy
> features = numpy.ndarray((3))
> data = sc.parallelize([LabeledPoint(1, features)])
> LinearRegressionWithSGD.train(data)
> {code}
> Recommendation: Allow users to use integers from Python.
> The error message you get is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o55.trainLinearRegressionModelWithSGD.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7
> in stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0
> (TID 15, localhost): java.lang.ClassCastException: java.lang.Integer cannot
> be cast to java.lang.Double
> at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
> at
> org.apache.spark.mllib.api.python.SerDe$LabeledPointPickler.construct(PythonMLLibAPI.scala:727)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
> at
> org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:804)
> at
> org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:803)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1309)
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]