[
https://issues.apache.org/jira/browse/PARQUET-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500125#comment-16500125
]
ASF GitHub Bot commented on PARQUET-1317:
-----------------------------------------
nandorKollar commented on a change in pull request #489: PARQUET-1317: Fix
ParquetMetadataConverter throw NPE
URL: https://github.com/apache/parquet-mr/pull/489#discussion_r192676768
##########
File path:
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
##########
@@ -1170,9 +1170,11 @@ private void buildChildren(Types.GroupBuilder builder,
}
if (schemaElement.isSetConverted_type()) {
LogicalTypeAnnotation originalType =
getOriginalType(schemaElement.converted_type, schemaElement);
- LogicalTypeAnnotation newLogicalType =
getOriginalType(schemaElement.logicalType);
- if (!originalType.equals(newLogicalType)) {
- childBuilder.as(getOriginalType(schemaElement.converted_type,
schemaElement));
+ if (schemaElement.isSetLogicalType()) {
Review comment:
I'm afraid that this won't work for old files. Those Parquet files which
were written before the new logical types API was introduced don't have the
[new logical
types](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L390)
field in the metadata, therefore isSetLogicalType will return false, and the
converter won't set any logical type for old files. The goal here was:
- first
[read](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1168)
new logical type if present (not present for old files)
- the
[read](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1171)
converted type (this should be present for both old and new files)
- convert the original type to the new parquet-mr logical type
representation (converted_type in metadata), then convert the new logical type
(logicalType in metadata) to the new parquet-mr logical type representation,
compare them, and if they mismatch, then converted_type "wins": the annotation
in converted_type metadata field should be used. This should work if both
logicalType and converted_type is set in schema (after calling getOriginalType
they should map to the same), or if logicalType is null (for older files)
With this change the `childBuilder.as(originalType)` is called only when
both fields are written, and the case where `logicalType` is null but
`converted_type` is not is not handled. The tests are green, because they
read/write files using the new API, which sets both fields: unfortunately there
is a missing test case for testing backward compatibility. How about something
like this instead:
```
if (schemaElement.isSetConverted_type()) {
LogicalTypeAnnotation originalType =
getOriginalType(schemaElement.converted_type, schemaElement);
LogicalTypeAnnotation newLogicalType =
schemaElement.isSetLogicalType() ? getOriginalType(schemaElement.logicalType) :
null;
if (!originalType.equals(newLogicalType)) {
childBuilder.as(getOriginalType(schemaElement.converted_type,
schemaElement));
}
}
```
Also, it would be nice to have a test case for this too, if you've time
could you please add a unit test to verify backward compatibility? In case you
don't, then let's get this committed, and I can add a relevant test in a
followup PR.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> ParquetMetadataConverter throw NPE
> ----------------------------------
>
> Key: PARQUET-1317
> URL: https://issues.apache.org/jira/browse/PARQUET-1317
> Project: Parquet
> Issue Type: Bug
> Affects Versions: 1.10.1
> Reporter: Yuming Wang
> Assignee: Yuming Wang
> Priority: Major
>
> How to reproduce:
> {code:scala}
> $ bin/spark-shell
> scala> spark.range(10).selectExpr("cast(id as string) as
> id").coalesce(1).write.parquet("/tmp/parquet-1317")
> scala>
> java -jar ./parquet-tools/target/parquet-tools-1.10.1-SNAPSHOT.jar head
> --debug
> file:///tmp/parquet-1317/part-00000-6cfafbdd-fdeb-4861-8499-8583852ba437-c000.snappy.parquet
> {code}
> {noformat}
> java.io.IOException: Could not read footer: java.lang.NullPointerException
> at
> org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:271)
> at
> org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:202)
> at
> org.apache.parquet.hadoop.ParquetFileReader.readFooters(ParquetFileReader.java:354)
> at
> org.apache.parquet.tools.command.RowCountCommand.execute(RowCountCommand.java:88)
> at org.apache.parquet.tools.Main.main(Main.java:223)
> Caused by: java.lang.NullPointerException
> at
> org.apache.parquet.format.converter.ParquetMetadataConverter.getOriginalType(ParquetMetadataConverter.java:828)
> at
> org.apache.parquet.format.converter.ParquetMetadataConverter.buildChildren(ParquetMetadataConverter.java:1173)
> at
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetSchema(ParquetMetadataConverter.java:1124)
> at
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:1058)
> at
> org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1052)
> at
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:532)
> at
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
> at
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
> at
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:476)
> at
> org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:261)
> at
> org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:257)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> java.io.IOException: Could not read footer:
> java.lang.NullPointerException{noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)