[
https://issues.apache.org/jira/browse/SPARK-22482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247516#comment-16247516
]
Sean Owen commented on SPARK-22482:
-----------------------------------
The error comes from Parquet and you are already pursuing there. Could well be
a version mismatch issue. I don't see evidence of a Spark problem here.
> Unreadable Parquet array columns
> --------------------------------
>
> Key: SPARK-22482
> URL: https://issues.apache.org/jira/browse/SPARK-22482
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.1.0
> Environment: Spark 2.1.0
> Parquet 1.8.1
> Hive 1.2
> Hive 2.1.0
> presto 0.157
> presto 0.180
> Reporter: Costas Piliotis
>
> We have seen an issue with writing out parquet data from spark. int and
> bool arrays seem to be throwing exceptions when trying to read the parquet
> files from hive and presto.
> I've logged a ticket here: PARQUET-1157 with the parquet project but I'm not
> sure if it's an issue within their project or an issue with spark itself.
> Spark is reading parquet-avro data which is output by a mapreduce job and
> writing it out to parquet.
> The inbound parquet format has the column defined as:
> {code}
> optional group playerpositions_ai (LIST) {
> repeated int32 array;
> }
> {code}
> Spark is redefining this data as this:
> {code}
> optional group playerpositions_ai (LIST) {
> repeated group list {
> optional int32 element;
> }
> }
> {code}
> and with legacy format:
> {code}
> optional group playerpositions_ai (LIST) {
> repeated group bag {
> optional int32 array;
> }
> }
> {code}
> The parquet data was tested in Hive 1.2, Hive 2.1, Presto 0.157, Presto
> 0.180, and Spark 2.1, as well as Amazon Athena (which is some form of presto
> implementation).
> I believe that the above schema is valid for writing out parquet.
> The spark command writing it out is simple:
> {code}
> data.repartition(((data.count() / 10000000) +
> 1).toInt).write.format("parquet")
> .mode("append")
> .partitionBy(partitionColumns: _*)
> .save(path)
> {code}
> We initially wrote this out with legacy format turned off but later turned on
> legacy format and have seen this error occur the same way with legacy off and
> on.
> Spark's stack trace from reading this is:
> {code}
> java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream.
> at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
> at
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82)
> at
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64)
> at
> org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readInteger(DictionaryValuesReader.java:112)
> at
> org.apache.parquet.column.impl.ColumnReaderImpl$2$3.read(ColumnReaderImpl.java:243)
> at
> org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464)
> at
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:370)
> at
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
> at
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
> at
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
> at
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Also do note that our data is stored on S3 if that matters.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]