[
https://issues.apache.org/jira/browse/PARQUET-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942621#comment-16942621
]
Oleksii Duzhyi commented on PARQUET-1157:
-----------------------------------------
We are facing the same issue, do you have a way to reliably reproduce it?
> Parquet Write bug - parquet data unreadable by hive or presto or spark 2.1
> --------------------------------------------------------------------------
>
> Key: PARQUET-1157
> URL: https://issues.apache.org/jira/browse/PARQUET-1157
> Project: Parquet
> Issue Type: Bug
> Affects Versions: 1.8.1
> Environment: parquet-avro
> spark 2.1
> hive 1.2
> hive 2.1.0
> presto 0.157
> presto 0.180
> Reporter: Costas Piliotis
> Priority: Major
> Attachments: log_106898428_1510201521.txt20171109-25172-1jt8dp2
>
>
> In our paradigm, we have a mapreduce output parquet data to s3, and then we
> use a spark job to consolidate these files from our staging area into target
> tables and add the partitions and modify tables as need be.
> We have implemented and are using parquet schema merging in hive.
> The data written from our mapreduce task shows for this column the following
> metadata (written as parquet-avro):
> {code}
> optional group playerpositions_ai (LIST) {
> repeated int32 array;
> }
> {code}
> However when spark writes it out it is converted. We have tried both the
> legacy parquet format on and off.
> With legacy
> {code}
> optional group playerpositions_ai (LIST) {
> repeated group list {
> optional int32 element;
> }
> }
> {code}
> and with legacy:
> {code}
> optional group playerpositions_ai (LIST) {
> repeated group bag {
> optional int32 array;
> }
> }
> {code}
> From what I've been reading in the spec the latter seems valid.
> Sporadically we see some array columns producing odd failures in this parquet
> format on read:
> {code}
> Query 20171108_224243_00083_ec9ww failed:
> com.facebook.presto.spi.PrestoException
> Can not read value at 28857 in block 0 in file s3://.....
> com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.advanceNextPosition(ParquetHiveRecordCursor.java:232)
> com.facebook.presto.hive.HiveCoercionRecordCursor.advanceNextPosition(HiveCoercionRecordCursor.java:98)
> com.facebook.presto.hive.HiveRecordCursor.advanceNextPosition(HiveRecordCursor.java:179)
> com.facebook.presto.spi.RecordPageSource.getNextPage(RecordPageSource.java:99)
> com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:247)
> com.facebook.presto.operator.Driver.processInternal(Driver.java:378)
> com.facebook.presto.operator.Driver.processFor(Driver.java:301)
> com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
> com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:534)
> com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:670)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> And in spark reading this file:
> {code}
> java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream.
> at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
> at
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82)
> at
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64)
> at
> org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readInteger(DictionaryValuesReader.java:112)
> at
> org.apache.parquet.column.impl.ColumnReaderImpl$2$3.read(ColumnReaderImpl.java:243)
> at
> org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464)
> at
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:370)
> at
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
> at
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
> at
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
> at
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> I'm perhaps hopeful that this bug has been fixed and is related to PARQUET-511
> For giggles I also took this parquet data and loaded it into Amazon Athena
> (which is basically presto anyway) in hopes that it was corruption on our end
> and Athena is throwing the same thing
> {code}
> HIVE_CURSOR_ERROR: Can not read value at 28857 in block 0 in file
> {code}
> The integer value isn't particularly interesting; it's a 0.
> The parquet write command we used in spark is not particularly interesting.
> {code}
> data.repartition(((data.count() / 10000000) +
> 1).toInt).write.format("parquet")
> .mode("append")
> .partitionBy(partitionColumns: _*)
> .save(path)
> {code}
> Currently our vendor has not been successful in moving our libraries to
> parquet 1.9 at this time. I believe this issue if it's related to
> PARQUET-511 should be resolved by our vendor, but I'm seeking clarification
> if this is in fact the case.
> My version of parquet tools on my desktop:
> * can totally dump the contents of that column without error
> * is on parquet 1.9
> At this point I'm stumped and I believe this to be a bug somewhere.
> If this is a duplicate of PARQUET-511, cool, but if hive, presto, and spark
> are all struggling to read this file written out by spark I'm inclined to
> believe it's either spark or parquet library itself.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)