[jira] [Updated] (PARQUET-1157) Parquet Write bug - parquet data unreadable by hive or presto or spark 2.1

Costas Piliotis (JIRA) Wed, 08 Nov 2017 20:29:11 -0800

     [ 
https://issues.apache.org/jira/browse/PARQUET-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Costas Piliotis updated PARQUET-1157:
-------------------------------------
    Environment: 
parquet-avro
spark 2.1
hive 1.2
hive 2.1.0
presto 0.157
presto 0.180


  was:
parquet-avro
spark 2.1
hive 1.2
hive 2.1.1
presto 0.157
presto 0.180



> Parquet Write bug - parquet data unreadable by hive or presto or spark 2.1
> --------------------------------------------------------------------------
>
>                 Key: PARQUET-1157
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1157
>             Project: Parquet
>          Issue Type: Bug
>    Affects Versions: 1.8.1
>         Environment: parquet-avro
> spark 2.1
> hive 1.2
> hive 2.1.0
> presto 0.157
> presto 0.180
>            Reporter: Costas Piliotis
>         Attachments: log_106898428_1510201521.txt20171109-25172-1jt8dp2
>
>
> In our paradigm, we have a mapreduce output parquet data to s3, and then we 
> use a spark job to consolidate these files from our staging area into target 
> tables and add the partitions and modify tables as need be.
> We have implemented and are using parquet schema evolution.    
> The data written from our mapreduce task shows for this column the following 
> metadata (written as parquet-avro):
> {code}
>   optional group playerpositions_ai (LIST) {
>     repeated int32 array;
>   }
> {code}
> However when spark writes it out it is converted.   We have tried both the 
> legacy parquet format on and off.
> With legacy 
> {code}
>   optional group playerpositions_ai (LIST) {
>     repeated group list {
>       optional int32 element;
>     }
>   }
> {code}
> and with legacy:
> {code}
>   optional group playerpositions_ai (LIST) {
>     repeated group bag {
>       optional int32 array;
>     }
>   }
> {code}
> From what I've been reading in the spec the latter seems valid.
> Sporadically we see some array columns producing odd failures in this parquet 
> format on read:
> {code}
> Query 20171108_224243_00083_ec9ww failed: 
> com.facebook.presto.spi.PrestoException
> Can not read value at 28857 in block 0 in file s3://.....
> com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.advanceNextPosition(ParquetHiveRecordCursor.java:232)
> com.facebook.presto.hive.HiveCoercionRecordCursor.advanceNextPosition(HiveCoercionRecordCursor.java:98)
> com.facebook.presto.hive.HiveRecordCursor.advanceNextPosition(HiveRecordCursor.java:179)
> com.facebook.presto.spi.RecordPageSource.getNextPage(RecordPageSource.java:99)
> com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:247)
> com.facebook.presto.operator.Driver.processInternal(Driver.java:378)
> com.facebook.presto.operator.Driver.processFor(Driver.java:301)
> com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
> com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:534)
> com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:670)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> And in spark reading this file:
> {code}
> java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream.
>       at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>       at 
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82)
>       at 
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64)
>       at 
> org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readInteger(DictionaryValuesReader.java:112)
>       at 
> org.apache.parquet.column.impl.ColumnReaderImpl$2$3.read(ColumnReaderImpl.java:243)
>       at 
> org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464)
>       at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:370)
>       at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
>       at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
>       at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
>       at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>       at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>       at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>       at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>       at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>       at org.apache.spark.scheduler.Task.run(Task.scala:99)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> I'm perhaps hopeful that this bug has been fixed and is related to PARQUET-511
> For giggles I also took this parquet data and loaded it into Amazon Athena 
> (which is basically presto anyway) in hopes that it was corruption on our end 
> and Athena is throwing the same thing 
> {code}
> HIVE_CURSOR_ERROR: Can not read value at 28857 in block 0 in file 
> {code}
> The integer value isn't particularly interesting; it's a 0.
> The parquet write command we used in spark is not particularly interesting.  
> {code}
>       data.repartition(((data.count() / 10000000) + 
> 1).toInt).write.format("parquet")
>         .mode("append")
>         .partitionBy(partitionColumns: _*)
>         .save(path)
> {code}
> Currently our vendor has not been successful in moving our libraries to 
> parquet 1.9 at this time.   I believe this issue if it's related to 
> PARQUET-511 should be resolved by our vendor, but I'm seeking clarification 
> if this is in fact the case.
> My version of parquet tools on my desktop:
> * can totally dump the contents of that column without error
> * is on parquet 1.9
> At this point I'm stumped and I believe this to be a bug somewhere.   
> If this is a duplicate of PARQUET-511, cool, but if hive, presto, and spark 
> are all struggling to read this file written out by spark I'm inclined to 
> believe it's either spark or parquet library itself.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (PARQUET-1157) Parquet Write bug - parquet data unreadable by hive or presto or spark 2.1

Reply via email to