Joseph K. Bradley created SPARK-4328:
----------------------------------------
Summary: Python serialization updates make Python ML API more
brittle to types
Key: SPARK-4328
URL: https://issues.apache.org/jira/browse/SPARK-4328
Project: Spark
Issue Type: Improvement
Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
In Spark 1.1, you could create a LabeledPoint with labels specified as
integers, and then use it with LinearRegression. This was broken by the Python
API updates since then. E.g., this code runs in the 1.1 branch but not in the
current master:
{code}
from pyspark.mllib.regression import *
import numpy
features = numpy.ndarray((3))
data = sc.parallelize([LabeledPoint(1, features)])
LinearRegressionWithSGD.train(data)
{code}
Recommendation: Allow users to use integers from Python.
The error message you get is:
{code}
py4j.protocol.Py4JJavaError: An error occurred while calling
o55.trainLinearRegressionModelWithSGD.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in
stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0 (TID
15, localhost): java.lang.ClassCastException: java.lang.Integer cannot be cast
to java.lang.Double
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
at
org.apache.spark.mllib.api.python.SerDe$LabeledPointPickler.construct(PythonMLLibAPI.scala:727)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
at
org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:804)
at
org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:803)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1309)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
at
org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
at
org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]