[
https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601491#comment-17601491
]
J Y commented on PARQUET-2181:
------------------------------
[~theosib-amazon], i think they're similar in that there's schema issues w/ the
avro conversion. i believe the root cause is that using avro internally to
read parquet files loses expressiveness, so you get avro schema validation or
mismatched schema issues as a consequence.
specs-wise, you have to work around the limitations of avro's dsl to truly
capture the parquet schema properly. i believe people typically don't hit this
since the typical open source pattern is to start w/ an avro schema as the
basis. for people who aren't, you'll have problems.
> parquet-cli fails at supporting parquet-protobuf generated files
> ----------------------------------------------------------------
>
> Key: PARQUET-2181
> URL: https://issues.apache.org/jira/browse/PARQUET-2181
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cli
> Reporter: J Y
> Priority: Critical
> Attachments: sample-depth-1.tgz, samples.tgz
>
>
> i generated a parquet file using a protobuf with this proto definition:
> {code:java}
> message IndexPath {
> // Index of item in path.
> repeated int32 index = 1;
> }
> message SomeEvent {
> // truncated/obfuscated wrapper
> optional IndexPath client_position = 1;
> }
> {code}
> this gets translated to the following parquet schema using the new compliant
> schema for lists:
> {code:java}
> message SomeEvent {
> optional group client_position = 1 {
> optional group index (LIST) = 1 {
> repeated group list {
> required int32 element;
> }
> }
> }
> }
> {code}
> this causes parquet-cli cat to barf on a file containing these events:
> {quote}java.lang.RuntimeException: Failed on record 0
> at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
> at org.apache.parquet.cli.Main.run(Main.java:157)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.parquet.cli.Main.main(Main.java:187)
> Caused by: java.lang.ClassCastException: required int32 element is not a group
> at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
> at
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
> at
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
> at
> org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
> at
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.<init>(AvroRecordConverter.java:539)
> at
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.<init>(AvroRecordConverter.java:489)
> at
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
> at
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:137)
> at
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
> at
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:137)
> at
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:91)
> at
> org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
> at
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142)
> at
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190)
> at
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166)
> at
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> at
> org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
> at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:344)
> at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
> at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
> ... 3 more
> {quote}
> using the old parquet-tools binary to cat this file works fine.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)