[ 
https://issues.apache.org/jira/browse/PARQUET-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15985126#comment-15985126
 ] 

Julien Le Dem commented on PARQUET-964:
---------------------------------------

Nice. 
I had made this ValidatingRecordConsumer to catch those:
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/ValidatingRecordConsumer.java
It is turned off by default because it is relatively expensive.

> Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: 
> totalValueCount '0' <= 0
> ---------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-964
>                 URL: https://issues.apache.org/jira/browse/PARQUET-964
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Constantin Muraru
>         Attachments: ListOfList.proto, ListOfListProtoParquetConverter.java, 
> parquet_totalValueCount.png
>
>
> Hi folks!
> We're working on adding support for ProtoParquet to work with Hive / AWS 
> Athena (Presto) \[1\]. The problem we've encountered appears whenever we 
> declare a repeated field (array) or a map in the protobuf schema and we then 
> try to convert it to parquet. The conversion works fine, but when we try to 
> query the data with Hive/Presto, we get some freaky errors.
> We've noticed though that AvroToParquet works great, even when we declare 
> such fields (arrays, maps)! 
> Comparing the parquet schema generated by protobuf vs avro, we've noticed a 
> few differences.
> Take the simple schema below (protobuf):
> {code}
> message ListOfList {
>     string top_field = 1;
>     repeated MyInnerMessage first_array = 2;
> }
> message MyInnerMessage {
>     int32 inner_field = 1;
>     repeated int32 second_array = 2;
> }
> {code}
> After using ProtoParquetWriter, the resulting parquet schema is the following:
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   repeated group first_array {
>     optional int32 inner_field;
>     repeated int32 second_array;
>   }
> }
> {code}
> When we try to query this data, we get parsing errors from Hive/Athena. The 
> parsing errors are related to the array/map fields.
> However, if we create a similar avro schema, the parquet result of the 
> AvroParquetWriter is the following:
> {code}
> message TestProtobuf.ListOfList {
>   required binary top_field (UTF8);
>   required group first_array (LIST) {
>     repeated group array {
>       required int32 inner_field;
>       required group second_array (LIST) {
>         repeated int32 array;
>       }
>     }
>   }
> }
> {code}
> This works beautifully with Hive/Athena. Too bad our systems are stuck with 
> protobuf :-) .
> You can see the additional wrappers which are missing from protobuf: 
> {{required group first_array (LIST)}}.
> Our goal is to make the ProtoParquetWriter generate a parquet schema similar 
> to what Avro is doing. We basically want to add these wrappers around 
> lists/maps.
> Everything seemed to work great, until we've bumped into an issue. We tuned 
> ProtoParquetWriter to generate the same parquet schema as AvroParquetWriter. 
> However, one difference between protobuf and avro is that in protobuf we can 
> have a bunch of Optional fields. 
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   required group first_array (LIST) {
>     repeated group array {
>       optional int32 inner_field;
>       required group second_array (LIST) {
>         repeated int32 array;
>       }
>     }
>   }
> }
> {code}
> Notice the: *optional* int32 inner_field (for avro that was *required*).
> When testing with some real proto-parquet data, we get an error every time 
> inner_field is not populated, but the second_array is.
> {noformat}
> parquet-tools cat /tmp/test23.parquet
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file file:/tmp/test23.parquet
>       at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:223)
>       at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:122)
>       at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:126)
>       at 
> org.apache.parquet.tools.command.CatCommand.execute(CatCommand.java:79)
>       at org.apache.parquet.proto.tools.Main.main(Main.java:214)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
> Caused by: org.apache.parquet.io.ParquetDecodingException: totalValueCount 
> '0' <= 0
>       at 
> org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:349)
>       at 
> org.apache.parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:82)
>       at 
> org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:77)
>       at 
> org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:272)
>       at 
> org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:145)
>       at 
> org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:107)
>       at 
> org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:155)
>       at 
> org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:107)
>       at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:136)
>       at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194)
>       ... 9 more
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file file:/tmp/test23.parquet
> Process finished with exit code 1
> {noformat}
> Basically this errors occurs whenever the {{first_array.inner_field}} is not 
> populated, but {{first_array.second_array}} is.
> I'm attaching the code used to generate the parquet files (though keep in 
> mind that we're working on a fork atm).
> Going through the code, I've noticed that the errors stop and everything 
> seems to work fine, once I change this condition in ColumnReaderImpl: 
> From:
> {code}
> if (totalValueCount <= 0) {
>       throw new ParquetDecodingException("totalValueCount '" + 
> totalValueCount + "' <= 0");
> }
> {code}
> To:
> {code}
> if (totalValueCount < 0) {
>       throw new ParquetDecodingException("totalValueCount '" + 
> totalValueCount + "' < 0");
> }
> {code}
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReaderImpl.java#L355
> --->
> {noformat}
> parquet-tools cat /tmp/test24.parquet
> [main] INFO org.apache.parquet.hadoop.InternalParquetRecordReader - 
> RecordReader initialized will read a total of 10 records.
> [main] INFO org.apache.parquet.hadoop.InternalParquetRecordReader - at row 0. 
> reading next block
> [main] INFO org.apache.parquet.hadoop.InternalParquetRecordReader - block 
> read in memory in 27 ms. row count = 10
> top_field = top_field
> first_array:
> .array:
> ..second_array:
> ...array = 20
> top_field = top_field
> first_array:
> .array:
> ..second_array:
> ...array = 20
> {noformat}
> I am wondering what are your thoughts on this? Should we change this 
> condition to {{if (totalValueCount < 0)}}?
> Any feedback is gladly appreciated! Let me know if I missed some information.
> Thanks,
> Costi
> \[1\] https://aws.amazon.com/athena/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to