[ https://issues.apache.org/jira/browse/PARQUET-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500125#comment-16500125 ]
ASF GitHub Bot commented on PARQUET-1317: ----------------------------------------- nandorKollar commented on a change in pull request #489: PARQUET-1317: Fix ParquetMetadataConverter throw NPE URL: https://github.com/apache/parquet-mr/pull/489#discussion_r192676768 ########## File path: parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java ########## @@ -1170,9 +1170,11 @@ private void buildChildren(Types.GroupBuilder builder, } if (schemaElement.isSetConverted_type()) { LogicalTypeAnnotation originalType = getOriginalType(schemaElement.converted_type, schemaElement); - LogicalTypeAnnotation newLogicalType = getOriginalType(schemaElement.logicalType); - if (!originalType.equals(newLogicalType)) { - childBuilder.as(getOriginalType(schemaElement.converted_type, schemaElement)); + if (schemaElement.isSetLogicalType()) { Review comment: I'm afraid that this won't work for old files. Those Parquet files which were written before the new logical types API was introduced don't have the [new logical types](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L390) field in the metadata, therefore isSetLogicalType will return false, and the converter won't set any logical type for old files. The goal here was: - first [read](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1168) new logical type if present (not present for old files) - the [read](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1171) converted type (this should be present for both old and new files) - convert the original type to the new parquet-mr logical type representation (converted_type in metadata), then convert the new logical type (logicalType in metadata) to the new parquet-mr logical type representation, compare them, and if they mismatch, then converted_type "wins": the annotation in converted_type metadata field should be used. This should work if both logicalType and converted_type is set in schema (after calling getOriginalType they should map to the same), or if logicalType is null (for older files) With this change the `childBuilder.as(originalType)` is called only when both fields are written, and the case where `logicalType` is null but `converted_type` is not is not handled. The tests are green, because they read/write files using the new API, which sets both fields: unfortunately there is a missing test case for testing backward compatibility. How about something like this instead: ``` if (schemaElement.isSetConverted_type()) { LogicalTypeAnnotation originalType = getOriginalType(schemaElement.converted_type, schemaElement); LogicalTypeAnnotation newLogicalType = schemaElement.isSetLogicalType() ? getOriginalType(schemaElement.logicalType) : null; if (!originalType.equals(newLogicalType)) { childBuilder.as(getOriginalType(schemaElement.converted_type, schemaElement)); } } ``` Also, it would be nice to have a test case for this too, if you've time could you please add a unit test to verify backward compatibility? In case you don't, then let's get this committed, and I can add a relevant test in a followup PR. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > ParquetMetadataConverter throw NPE > ---------------------------------- > > Key: PARQUET-1317 > URL: https://issues.apache.org/jira/browse/PARQUET-1317 > Project: Parquet > Issue Type: Bug > Affects Versions: 1.10.1 > Reporter: Yuming Wang > Assignee: Yuming Wang > Priority: Major > > How to reproduce: > {code:scala} > $ bin/spark-shell > scala> spark.range(10).selectExpr("cast(id as string) as > id").coalesce(1).write.parquet("/tmp/parquet-1317") > scala> > java -jar ./parquet-tools/target/parquet-tools-1.10.1-SNAPSHOT.jar head > --debug > file:///tmp/parquet-1317/part-00000-6cfafbdd-fdeb-4861-8499-8583852ba437-c000.snappy.parquet > {code} > {noformat} > java.io.IOException: Could not read footer: java.lang.NullPointerException > at > org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:271) > at > org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:202) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooters(ParquetFileReader.java:354) > at > org.apache.parquet.tools.command.RowCountCommand.execute(RowCountCommand.java:88) > at org.apache.parquet.tools.Main.main(Main.java:223) > Caused by: java.lang.NullPointerException > at > org.apache.parquet.format.converter.ParquetMetadataConverter.getOriginalType(ParquetMetadataConverter.java:828) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.buildChildren(ParquetMetadataConverter.java:1173) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetSchema(ParquetMetadataConverter.java:1124) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:1058) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1052) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:532) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:476) > at > org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:261) > at > org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:257) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > java.io.IOException: Could not read footer: > java.lang.NullPointerException{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)