[ 
https://issues.apache.org/jira/browse/SPARK-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173383#comment-15173383
 ] 

Jeff Zhang commented on SPARK-13581:
------------------------------------

I suspect it is issue in the code generation. Because the root cause is that it 
should read the column features but actually it read the column label, so cause 
the match error. And df.show() is successful without any selection.  The 
stacktrace shows the error come from code generator. Can any guy familiar with 
code generation help on this ?

{code}
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 5.0 (TID 5, localhost): scala.MatchError: 0.0 (of class 
java.lang.Double)
        at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207)
        at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192)
        at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142)
        at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
        at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
        at 
org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:63)
        at 
org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:60)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:40)
        at 
org.apache.spark.sql.execution.WholeStageCodegen$$anonfun$5$$anon$1.hasNext(WholeStageCodegen.scala:305)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
        at scala.collection.Iterator$class.foreach(Iterator.scala:742)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
        at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
{code}

> LibSVM throws MatchError
> ------------------------
>
>                 Key: SPARK-13581
>                 URL: https://issues.apache.org/jira/browse/SPARK-13581
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Jakob Odersky
>            Assignee: Jeff Zhang
>            Priority: Minor
>
> When running an action on a DataFrame obtained by reading from a libsvm file 
> a MatchError is thrown, however doing the same on a cached DataFrame works 
> fine.
> {code}
> val df = 
> sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") 
> //file is in spark repository
> df.select(df("features")).show() //MatchError
> df.cache()
> df.select(df("features")).show() //OK
> {code}
> The exception stack trace is the following:
> {code}
> scala.MatchError: 1.0 (of class java.lang.Double)
> [info]        at 
> org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207)
> [info]        at 
> org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192)
> [info]        at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142)
> [info]        at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
> [info]        at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
> [info]        at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59)
> [info]        at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56)
> {code}
> This issue first appeared in commit {{1dac964c1}}, in PR 
> [#9595|https://github.com/apache/spark/pull/9595] fixing SPARK-11622.
> [~jeffzhang], do you have any insight of what could be going on?
> cc [~iyounus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to