[ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Romeo Kienzer updated SPARK-24828:
----------------------------------
    Attachment: a2_m2.parquet.zip

> Incompatible parquet formats - java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
> -------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24828
>                 URL: https://issues.apache.org/jira/browse/SPARK-24828
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>         Environment: Environment for creating the parquet file:
> IBM Watson Studio Apache Spark Service, V2.1.2
> Environment for reading the parquet file:
> java version "1.8.0_144"
> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
> MacOSX 10.13.3 (17D47)
> Spark spark-2.1.2-bin-hadoop2.7 directly obtained from 
> http://spark.apache.org/downloads.html
>            Reporter: Romeo Kienzer
>            Priority: Minor
>         Attachments: a2_m2.parquet.zip
>
>
> As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
> here
>  
> Using the attached parquet file from one Spark installation, reading it using 
> an installation directly obtained from 
> [http://spark.apache.org/downloads.html] yields to the following exception:
>  
> 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
>  scala.MatchError: [1.0,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
>  java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
>      at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>      at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>      at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>      at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>      at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  
> The file is attached [^a2_m2.parquet.zip]
>  
> The following code reproduces the error:
> df = spark.read.parquet('a2_m2.parquet')
> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
> binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
> .setPredictionCol("prediction").setLabelCol("label")
> accuracy = binEval.evaluate(df)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to