[ 
https://issues.apache.org/jira/browse/PARQUET-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500125#comment-16500125
 ] 

ASF GitHub Bot commented on PARQUET-1317:
-----------------------------------------

nandorKollar commented on a change in pull request #489: PARQUET-1317: Fix 
ParquetMetadataConverter throw NPE
URL: https://github.com/apache/parquet-mr/pull/489#discussion_r192676768
 
 

 ##########
 File path: 
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
 ##########
 @@ -1170,9 +1170,11 @@ private void buildChildren(Types.GroupBuilder builder,
       }
       if (schemaElement.isSetConverted_type()) {
         LogicalTypeAnnotation originalType = 
getOriginalType(schemaElement.converted_type, schemaElement);
-        LogicalTypeAnnotation newLogicalType = 
getOriginalType(schemaElement.logicalType);
-        if (!originalType.equals(newLogicalType)) {
-          childBuilder.as(getOriginalType(schemaElement.converted_type, 
schemaElement));
+        if (schemaElement.isSetLogicalType()) {
 
 Review comment:
   I'm afraid that this won't work for old files. Those Parquet files which 
were written before the new logical types API was introduced don't have the 
[new logical 
types](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L390)
 field in the metadata, therefore isSetLogicalType will return false, and the 
converter won't set any logical type for old files. The goal here was:
   - first 
[read](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1168)
 new logical type if present (not present for old files)
   - the 
[read](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1171)
 converted type (this should be present for both old and new files)
   - convert the original type to the new parquet-mr logical type 
representation (converted_type in metadata), then convert the new logical type 
(logicalType in metadata) to the new parquet-mr logical type representation, 
compare them, and if they mismatch, then converted_type "wins": the annotation 
in converted_type metadata field should be used. This should work if both 
logicalType and converted_type is set in schema (after calling getOriginalType 
they should map to the same), or if logicalType is null (for older files)
   
   With this change the `childBuilder.as(originalType)` is called only when 
both fields are written, and the case where `logicalType` is null but 
`converted_type` is not is not handled. The tests are green, because they 
read/write files using the new API, which sets both fields: unfortunately there 
is a missing test case for testing backward compatibility. How about something 
like this instead:
   ```
         if (schemaElement.isSetConverted_type()) {
           LogicalTypeAnnotation originalType = 
getOriginalType(schemaElement.converted_type, schemaElement);
           LogicalTypeAnnotation newLogicalType = 
schemaElement.isSetLogicalType() ? getOriginalType(schemaElement.logicalType) : 
null;
           if (!originalType.equals(newLogicalType)) {
             childBuilder.as(getOriginalType(schemaElement.converted_type, 
schemaElement));
           }
         }
   ```
   Also, it would be nice to have a test case for this too, if you've time 
could you please add a unit test to verify backward compatibility? In case you 
don't, then let's get this committed, and I can add a relevant test in a 
followup PR.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> ParquetMetadataConverter throw NPE
> ----------------------------------
>
>                 Key: PARQUET-1317
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1317
>             Project: Parquet
>          Issue Type: Bug
>    Affects Versions: 1.10.1
>            Reporter: Yuming Wang
>            Assignee: Yuming Wang
>            Priority: Major
>
> How to reproduce:
> {code:scala}
> $ bin/spark-shell 
> scala> spark.range(10).selectExpr("cast(id as string) as 
> id").coalesce(1).write.parquet("/tmp/parquet-1317")
> scala> 
> java -jar ./parquet-tools/target/parquet-tools-1.10.1-SNAPSHOT.jar head 
> --debug 
> file:///tmp/parquet-1317/part-00000-6cfafbdd-fdeb-4861-8499-8583852ba437-c000.snappy.parquet
> {code}
> {noformat}
> java.io.IOException: Could not read footer: java.lang.NullPointerException
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:271)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:202)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooters(ParquetFileReader.java:354)
> at 
> org.apache.parquet.tools.command.RowCountCommand.execute(RowCountCommand.java:88)
> at org.apache.parquet.tools.Main.main(Main.java:223)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.getOriginalType(ParquetMetadataConverter.java:828)
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.buildChildren(ParquetMetadataConverter.java:1173)
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetSchema(ParquetMetadataConverter.java:1124)
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:1058)
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1052)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:532)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:476)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:261)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:257)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> java.io.IOException: Could not read footer: 
> java.lang.NullPointerException{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to