[ https://issues.apache.org/jira/browse/SPARK-4328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangrui Meng resolved SPARK-4328. ---------------------------------- Resolution: Duplicate This is covered in the PR for SPARK-4324. > Python serialization updates make Python ML API more brittle to types > --------------------------------------------------------------------- > > Key: SPARK-4328 > URL: https://issues.apache.org/jira/browse/SPARK-4328 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark > Affects Versions: 1.2.0 > Reporter: Joseph K. Bradley > > In Spark 1.1, you could create a LabeledPoint with labels specified as > integers, and then use it with LinearRegression. This was broken by the > Python API updates since then. E.g., this code runs in the 1.1 branch but > not in the current master: > {code} > from pyspark.mllib.regression import * > import numpy > features = numpy.ndarray((3)) > data = sc.parallelize([LabeledPoint(1, features)]) > LinearRegressionWithSGD.train(data) > {code} > Recommendation: Allow users to use integers from Python. > The error message you get is: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o55.trainLinearRegressionModelWithSGD. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 > in stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0 > (TID 15, localhost): java.lang.ClassCastException: java.lang.Integer cannot > be cast to java.lang.Double > at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119) > at > org.apache.spark.mllib.api.python.SerDe$LabeledPointPickler.construct(PythonMLLibAPI.scala:727) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) > at > org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:804) > at > org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:803) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1309) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910) > at > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) > at > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org