[GitHub] [iceberg] prodeezy opened a new issue #2962: Parquet 1.11.1 causes regressions while reading iceberg data written with v1.11.0

GitBox Wed, 11 Aug 2021 16:40:38 -0700


prodeezy opened a new issue #2962:
URL: https://github.com/apache/iceberg/issues/2962



   As part of https://github.com/apache/iceberg/issues/1441 Parquet version was 
updated from 1.11.0 to 1.11.1. After rebasing our internal version with latest 
changes from master  we found that certain fields written with iceberg using 
Parquet v1.11.0 are not readable with iceberg built against Parquet  v1.11.1
   
   **Error:**
   ```
   java.lang.IllegalArgumentException: [segmentMembership, map, key] required 
binary key (STRING) = 9 is not in the store: [[identityMap, map, key] required 
binary key (STRING) = 3, [identityMap, map, value, list, element, id] optional 
binary id (STRING) = 7, [identityMap, map, value, list, element, 
authenticatedState] optional binary authenticatedState (STRING) = 6, 
[identityMap, map, value, list, element, primary] optional boolean primary = 8] 
4
        at 
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ColumnChunkPageReadStore.getPageReader(ColumnChunkPageReadStore.java:231)
        at 
org.apache.iceberg.parquet.ParquetValueReaders$PrimitiveReader.setPageSource(ParquetValueReaders.java:185)
        at 
org.apache.iceberg.parquet.ParquetValueReaders$RepeatedKeyValueReader.setPageSource(ParquetValueReaders.java:529)
        at 
org.apache.iceberg.parquet.ParquetValueReaders$StructReader.setPageSource(ParquetValueReaders.java:685)
        at 
org.apache.iceberg.parquet.ParquetReader$FileIterator.advance(ParquetReader.java:142)
        at 
org.apache.iceberg.parquet.ParquetReader$FileIterator.next(ParquetReader.java:112)
        at org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:66)
        at org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:50)
        at 
org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:87)
        at 
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:49)
        at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
        at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117)
        at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116)
        at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1560)
        at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146)
        at 
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67)
        at 
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
        at org.apache.spark.scheduler.Task.run(Task.scala:112)
        at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   ```
   
   
   
   Field in question is of this form:
   
   ```
    |-- segmentMembership: map (nullable = true)
    |    |-- key: string
    |    |-- value: map (valueContainsNull = true)
    |    |    |-- key: string
    |    |    |-- value: struct (valueContainsNull = true)
    |    |    |    |-- payload: struct (nullable = true)
    |    |    |    |    |-- boolA: boolean (nullable = true)
    |    |    |    |    |-- doubleValueA: double (nullable = true)
    |    |    |    |    |-- doubleValueB: double (nullable = true)
    |    |    |    |    |-- stringValueA: string (nullable = true)
    |    |    |    |    |-- stringValueB: string (nullable = true)
    |    |    |    |-- status: string (nullable = true)
    |    |    |    |-- lastQualificationTime: timestamp (nullable = true)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] prodeezy opened a new issue #2962: Parquet 1.11.1 causes regressions while reading iceberg data written with v1.11.0

Reply via email to