Joseph K. Bradley created SPARK-4328:
----------------------------------------

             Summary: Python serialization updates make Python ML API more 
brittle to types
                 Key: SPARK-4328
                 URL: https://issues.apache.org/jira/browse/SPARK-4328
             Project: Spark
          Issue Type: Improvement
          Components: MLlib, PySpark
    Affects Versions: 1.2.0
            Reporter: Joseph K. Bradley


In Spark 1.1, you could create a LabeledPoint with labels specified as 
integers, and then use it with LinearRegression.  This was broken by the Python 
API updates since then.  E.g., this code runs in the 1.1 branch but not in the 
current master:

{code}
from pyspark.mllib.regression import *
import numpy
features = numpy.ndarray((3))
data = sc.parallelize([LabeledPoint(1, features)])
LinearRegressionWithSGD.train(data)
{code}

Recommendation: Allow users to use integers from Python.

The error message you get is:
{code}
py4j.protocol.Py4JJavaError: An error occurred while calling 
o55.trainLinearRegressionModelWithSGD.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in 
stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0 (TID 
15, localhost): java.lang.ClassCastException: java.lang.Integer cannot be cast 
to java.lang.Double
        at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
        at 
org.apache.spark.mllib.api.python.SerDe$LabeledPointPickler.construct(PythonMLLibAPI.scala:727)
        at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
        at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
        at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
        at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
        at 
org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:804)
        at 
org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:803)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1309)
        at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
        at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to