[jira] [Commented] (PARQUET-964) Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: totalValueCount '0' <= 0

2017-04-26 Thread Constantin Muraru (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985178#comment-15985178
 ] 

Constantin Muraru commented on PARQUET-964:
---

Will do, Julien. Once I polish it up, I'll open a PR.

> Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: 
> totalValueCount '0' <= 0
> -
>
> Key: PARQUET-964
> URL: https://issues.apache.org/jira/browse/PARQUET-964
> Project: Parquet
>  Issue Type: Bug
>Reporter: Constantin Muraru
> Attachments: ListOfList.proto, ListOfListProtoParquetConverter.java, 
> parquet_totalValueCount.png
>
>
> Hi folks!
> We're working on adding support for ProtoParquet to work with Hive / AWS 
> Athena (Presto) \[1\]. The problem we've encountered appears whenever we 
> declare a repeated field (array) or a map in the protobuf schema and we then 
> try to convert it to parquet. The conversion works fine, but when we try to 
> query the data with Hive/Presto, we get some freaky errors.
> We've noticed though that AvroToParquet works great, even when we declare 
> such fields (arrays, maps)! 
> Comparing the parquet schema generated by protobuf vs avro, we've noticed a 
> few differences.
> Take the simple schema below (protobuf):
> {code}
> message ListOfList {
> string top_field = 1;
> repeated MyInnerMessage first_array = 2;
> }
> message MyInnerMessage {
> int32 inner_field = 1;
> repeated int32 second_array = 2;
> }
> {code}
> After using ProtoParquetWriter, the resulting parquet schema is the following:
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   repeated group first_array {
> optional int32 inner_field;
> repeated int32 second_array;
>   }
> }
> {code}
> When we try to query this data, we get parsing errors from Hive/Athena. The 
> parsing errors are related to the array/map fields.
> However, if we create a similar avro schema, the parquet result of the 
> AvroParquetWriter is the following:
> {code}
> message TestProtobuf.ListOfList {
>   required binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   required int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> This works beautifully with Hive/Athena. Too bad our systems are stuck with 
> protobuf :-) .
> You can see the additional wrappers which are missing from protobuf: 
> {{required group first_array (LIST)}}.
> Our goal is to make the ProtoParquetWriter generate a parquet schema similar 
> to what Avro is doing. We basically want to add these wrappers around 
> lists/maps.
> Everything seemed to work great, until we've bumped into an issue. We tuned 
> ProtoParquetWriter to generate the same parquet schema as AvroParquetWriter. 
> However, one difference between protobuf and avro is that in protobuf we can 
> have a bunch of Optional fields. 
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   optional int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> Notice the: *optional* int32 inner_field (for avro that was *required*).
> When testing with some real proto-parquet data, we get an error every time 
> inner_field is not populated, but the second_array is.
> {noformat}
> parquet-tools cat /tmp/test23.parquet
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file file:/tmp/test23.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:223)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:122)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:126)
>   at 
> org.apache.parquet.tools.command.CatCommand.execute(CatCommand.java:79)
>   at org.apache.parquet.proto.tools.Main.main(Main.java:214)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
> Caused by: org.apache.parquet.io.ParquetDecodingException: totalValueCount 
> '0' <= 0
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:349)
>   at 
> org.apache.parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:82)
>   at 
> 

[jira] [Commented] (PARQUET-964) Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: totalValueCount '0' <= 0

2017-04-26 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985131#comment-15985131
 ] 

Julien Le Dem commented on PARQUET-964:
---

Thanks for getting to the bottom of it.
Let us know when you're project is working.

> Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: 
> totalValueCount '0' <= 0
> -
>
> Key: PARQUET-964
> URL: https://issues.apache.org/jira/browse/PARQUET-964
> Project: Parquet
>  Issue Type: Bug
>Reporter: Constantin Muraru
> Attachments: ListOfList.proto, ListOfListProtoParquetConverter.java, 
> parquet_totalValueCount.png
>
>
> Hi folks!
> We're working on adding support for ProtoParquet to work with Hive / AWS 
> Athena (Presto) \[1\]. The problem we've encountered appears whenever we 
> declare a repeated field (array) or a map in the protobuf schema and we then 
> try to convert it to parquet. The conversion works fine, but when we try to 
> query the data with Hive/Presto, we get some freaky errors.
> We've noticed though that AvroToParquet works great, even when we declare 
> such fields (arrays, maps)! 
> Comparing the parquet schema generated by protobuf vs avro, we've noticed a 
> few differences.
> Take the simple schema below (protobuf):
> {code}
> message ListOfList {
> string top_field = 1;
> repeated MyInnerMessage first_array = 2;
> }
> message MyInnerMessage {
> int32 inner_field = 1;
> repeated int32 second_array = 2;
> }
> {code}
> After using ProtoParquetWriter, the resulting parquet schema is the following:
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   repeated group first_array {
> optional int32 inner_field;
> repeated int32 second_array;
>   }
> }
> {code}
> When we try to query this data, we get parsing errors from Hive/Athena. The 
> parsing errors are related to the array/map fields.
> However, if we create a similar avro schema, the parquet result of the 
> AvroParquetWriter is the following:
> {code}
> message TestProtobuf.ListOfList {
>   required binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   required int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> This works beautifully with Hive/Athena. Too bad our systems are stuck with 
> protobuf :-) .
> You can see the additional wrappers which are missing from protobuf: 
> {{required group first_array (LIST)}}.
> Our goal is to make the ProtoParquetWriter generate a parquet schema similar 
> to what Avro is doing. We basically want to add these wrappers around 
> lists/maps.
> Everything seemed to work great, until we've bumped into an issue. We tuned 
> ProtoParquetWriter to generate the same parquet schema as AvroParquetWriter. 
> However, one difference between protobuf and avro is that in protobuf we can 
> have a bunch of Optional fields. 
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   optional int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> Notice the: *optional* int32 inner_field (for avro that was *required*).
> When testing with some real proto-parquet data, we get an error every time 
> inner_field is not populated, but the second_array is.
> {noformat}
> parquet-tools cat /tmp/test23.parquet
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file file:/tmp/test23.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:223)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:122)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:126)
>   at 
> org.apache.parquet.tools.command.CatCommand.execute(CatCommand.java:79)
>   at org.apache.parquet.proto.tools.Main.main(Main.java:214)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
> Caused by: org.apache.parquet.io.ParquetDecodingException: totalValueCount 
> '0' <= 0
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:349)
>   at 
> org.apache.parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:82)
> 

[jira] [Commented] (PARQUET-964) Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: totalValueCount '0' <= 0

2017-04-26 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985126#comment-15985126
 ] 

Julien Le Dem commented on PARQUET-964:
---

Nice. 
I had made this ValidatingRecordConsumer to catch those:
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/ValidatingRecordConsumer.java
It is turned off by default because it is relatively expensive.

> Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: 
> totalValueCount '0' <= 0
> -
>
> Key: PARQUET-964
> URL: https://issues.apache.org/jira/browse/PARQUET-964
> Project: Parquet
>  Issue Type: Bug
>Reporter: Constantin Muraru
> Attachments: ListOfList.proto, ListOfListProtoParquetConverter.java, 
> parquet_totalValueCount.png
>
>
> Hi folks!
> We're working on adding support for ProtoParquet to work with Hive / AWS 
> Athena (Presto) \[1\]. The problem we've encountered appears whenever we 
> declare a repeated field (array) or a map in the protobuf schema and we then 
> try to convert it to parquet. The conversion works fine, but when we try to 
> query the data with Hive/Presto, we get some freaky errors.
> We've noticed though that AvroToParquet works great, even when we declare 
> such fields (arrays, maps)! 
> Comparing the parquet schema generated by protobuf vs avro, we've noticed a 
> few differences.
> Take the simple schema below (protobuf):
> {code}
> message ListOfList {
> string top_field = 1;
> repeated MyInnerMessage first_array = 2;
> }
> message MyInnerMessage {
> int32 inner_field = 1;
> repeated int32 second_array = 2;
> }
> {code}
> After using ProtoParquetWriter, the resulting parquet schema is the following:
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   repeated group first_array {
> optional int32 inner_field;
> repeated int32 second_array;
>   }
> }
> {code}
> When we try to query this data, we get parsing errors from Hive/Athena. The 
> parsing errors are related to the array/map fields.
> However, if we create a similar avro schema, the parquet result of the 
> AvroParquetWriter is the following:
> {code}
> message TestProtobuf.ListOfList {
>   required binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   required int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> This works beautifully with Hive/Athena. Too bad our systems are stuck with 
> protobuf :-) .
> You can see the additional wrappers which are missing from protobuf: 
> {{required group first_array (LIST)}}.
> Our goal is to make the ProtoParquetWriter generate a parquet schema similar 
> to what Avro is doing. We basically want to add these wrappers around 
> lists/maps.
> Everything seemed to work great, until we've bumped into an issue. We tuned 
> ProtoParquetWriter to generate the same parquet schema as AvroParquetWriter. 
> However, one difference between protobuf and avro is that in protobuf we can 
> have a bunch of Optional fields. 
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   optional int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> Notice the: *optional* int32 inner_field (for avro that was *required*).
> When testing with some real proto-parquet data, we get an error every time 
> inner_field is not populated, but the second_array is.
> {noformat}
> parquet-tools cat /tmp/test23.parquet
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file file:/tmp/test23.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:223)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:122)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:126)
>   at 
> org.apache.parquet.tools.command.CatCommand.execute(CatCommand.java:79)
>   at org.apache.parquet.proto.tools.Main.main(Main.java:214)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
> Caused by: org.apache.parquet.io.ParquetDecodingException: totalValueCount 
> '0' <= 0
>   at 
> 

[jira] [Commented] (PARQUET-964) Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: totalValueCount '0' <= 0

2017-04-26 Thread Constantin Muraru (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984855#comment-15984855
 ] 

Constantin Muraru commented on PARQUET-964:
---

Oops, there was a problem with my updated ProtoParquetWriter and the way I used 
to call recordConsumer.
For the above mentioned schema, these were the calls that were made. Marked in 
bold, the calls that were missing and were causing the ParquetDecodingException 
exception.

recordConsumer.startMessage()
recordConsumer.startField(fieldName=top_field, field=0)
recordConsumer.endField(fieldName=top_field, field=0)
recordConsumer.startField(fieldName=first_array, field=1)
recordConsumer.startGroup()
recordConsumer.startField(fieldName=array, field=0)
*recordConsumer.startGroup()*
recordConsumer.startField(fieldName=second_array, field=1)
recordConsumer.startGroup()
recordConsumer.startField(fieldName=array, field=0)
recordConsumer.endField(fieldName=array, field=0)
recordConsumer.endGroup()
recordConsumer.endField(fieldName=second_array, field=1)
*recordConsumer.endGroup()*
recordConsumer.endField(fieldName=array, field=0)
recordConsumer.endGroup()
recordConsumer.endField(fieldName=first_array, field=1)
recordConsumer.endMessage()

After adding the missing startGroup/endGroup calls, the problem got fixed!

 We can close this ticket.

> Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: 
> totalValueCount '0' <= 0
> -
>
> Key: PARQUET-964
> URL: https://issues.apache.org/jira/browse/PARQUET-964
> Project: Parquet
>  Issue Type: Bug
>Reporter: Constantin Muraru
> Attachments: ListOfList.proto, ListOfListProtoParquetConverter.java, 
> parquet_totalValueCount.png
>
>
> Hi folks!
> We're working on adding support for ProtoParquet to work with Hive / AWS 
> Athena (Presto) \[1\]. The problem we've encountered appears whenever we 
> declare a repeated field (array) or a map in the protobuf schema and we then 
> try to convert it to parquet. The conversion works fine, but when we try to 
> query the data with Hive/Presto, we get some freaky errors.
> We've noticed though that AvroToParquet works great, even when we declare 
> such fields (arrays, maps)! 
> Comparing the parquet schema generated by protobuf vs avro, we've noticed a 
> few differences.
> Take the simple schema below (protobuf):
> {code}
> message ListOfList {
> string top_field = 1;
> repeated MyInnerMessage first_array = 2;
> }
> message MyInnerMessage {
> int32 inner_field = 1;
> repeated int32 second_array = 2;
> }
> {code}
> After using ProtoParquetWriter, the resulting parquet schema is the following:
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   repeated group first_array {
> optional int32 inner_field;
> repeated int32 second_array;
>   }
> }
> {code}
> When we try to query this data, we get parsing errors from Hive/Athena. The 
> parsing errors are related to the array/map fields.
> However, if we create a similar avro schema, the parquet result of the 
> AvroParquetWriter is the following:
> {code}
> message TestProtobuf.ListOfList {
>   required binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   required int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> This works beautifully with Hive/Athena. Too bad our systems are stuck with 
> protobuf :-) .
> You can see the additional wrappers which are missing from protobuf: 
> {{required group first_array (LIST)}}.
> Our goal is to make the ProtoParquetWriter generate a parquet schema similar 
> to what Avro is doing. We basically want to add these wrappers around 
> lists/maps.
> Everything seemed to work great, until we've bumped into an issue. We tuned 
> ProtoParquetWriter to generate the same parquet schema as AvroParquetWriter. 
> However, one difference between protobuf and avro is that in protobuf we can 
> have a bunch of Optional fields. 
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   optional int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> Notice the: *optional* int32 inner_field (for avro that was *required*).
> When testing with some real proto-parquet data, we get an error every time 
> inner_field is not populated, but the second_array is.
> {noformat}
> parquet-tools cat /tmp/test23.parquet
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file file:/tmp/test23.parquet
>   at 
> 

[jira] [Commented] (PARQUET-964) Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: totalValueCount '0' <= 0

2017-04-25 Thread Constantin Muraru (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983884#comment-15983884
 ] 

Constantin Muraru commented on PARQUET-964:
---

Thanks [~julienledem]! I'll create a test on my branch which exposes the issue 
more clearly and maybe we can discuss on that.

But as a hint, given the parquet schema below, the totalValueCount is 0 when 
columnReaderImpl.path = ColumnDescriptor {{[first_array, array, inner_field] 
INT32}} is absent.

{code}
message TestProtobuf.ListOfList {
  optional binary top_field (UTF8);
  required group first_array (LIST) {
repeated group array {
  optional int32 inner_field;
  required group second_array (LIST) {
repeated int32 array;
  }
}
  }
}
{code}

{code}
ListOfList message = ListOfList.newBuilder()
.setTopField("top_field")

.addFirstArray(ListOfListOuterClass.MyInnerMessage.newBuilder().addSecondArray(2))
 // inner_field missing here
.build();
{code}

> Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: 
> totalValueCount '0' <= 0
> -
>
> Key: PARQUET-964
> URL: https://issues.apache.org/jira/browse/PARQUET-964
> Project: Parquet
>  Issue Type: Bug
>Reporter: Constantin Muraru
> Attachments: ListOfList.proto, ListOfListProtoParquetConverter.java
>
>
> Hi folks!
> We're working on adding support for ProtoParquet to work with Hive / AWS 
> Athena (Presto) \[1\]. The problem we've encountered appears whenever we 
> declare a repeated field (array) or a map in the protobuf schema and we then 
> try to convert it to parquet. The conversion works fine, but when we try to 
> query the data with Hive/Presto, we get some freaky errors.
> We've noticed though that AvroToParquet works great, even when we declare 
> such fields (arrays, maps)! 
> Comparing the parquet schema generated by protobuf vs avro, we've noticed a 
> few differences.
> Take the simple schema below (protobuf):
> {code}
> message ListOfList {
> string top_field = 1;
> repeated MyInnerMessage first_array = 2;
> }
> message MyInnerMessage {
> int32 inner_field = 1;
> repeated int32 second_array = 2;
> }
> {code}
> After using ProtoParquetWriter, the resulting parquet schema is the following:
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   repeated group first_array {
> optional int32 inner_field;
> repeated int32 second_array;
>   }
> }
> {code}
> When we try to query this data, we get parsing errors from Hive/Athena. The 
> parsing errors are related to the array/map fields.
> However, if we create a similar avro schema, the parquet result of the 
> AvroParquetWriter is the following:
> {code}
> message TestProtobuf.ListOfList {
>   required binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   required int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> This works beautifully with Hive/Athena. Too bad our systems are stuck with 
> protobuf :-) .
> You can see the additional wrappers which are missing from protobuf: 
> {{required group first_array (LIST)}}.
> Our goal is to make the ProtoParquetWriter generate a parquet schema similar 
> to what Avro is doing. We basically want to add these wrappers around 
> lists/maps.
> Everything seemed to work great, until we've bumped into an issue. We tuned 
> ProtoParquetWriter to generate the same parquet schema as AvroParquetWriter. 
> However, one difference between protobuf and avro is that in protobuf we can 
> have a bunch of Optional fields. 
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   optional int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> Notice the: *optional* int32 inner_field (for avro that was *required*).
> When testing with some real proto-parquet data, we get an error every time 
> inner_field is not populated, but the second_array is.
> {noformat}
> parquet-tools cat /tmp/test23.parquet
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file file:/tmp/test23.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:223)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:122)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:126)
>   at 
> org.apache.parquet.tools.command.CatCommand.execute(CatCommand.java:79)
>   at 

[jira] [Commented] (PARQUET-964) Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: totalValueCount '0' <= 0

2017-04-25 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983784#comment-15983784
 ] 

Julien Le Dem commented on PARQUET-964:
---

totalValueCount includes null values so it should never be 0 unless you're 
creating empty parquet files?
separately, It should also not be negative (that indicates an overflow of the 
value since the underlying metadata accepts long)
Could you look into why totalValueCount == 0? This should be the sum of all 
value counts for all pages in that column chunk.


> Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: 
> totalValueCount '0' <= 0
> -
>
> Key: PARQUET-964
> URL: https://issues.apache.org/jira/browse/PARQUET-964
> Project: Parquet
>  Issue Type: Bug
>Reporter: Constantin Muraru
> Attachments: ListOfList.proto, ListOfListProtoParquetConverter.java
>
>
> Hi folks!
> We're working on adding support for ProtoParquet to work with Hive / AWS 
> Athena (Presto) \[1\]. The problem we've encountered appears whenever we 
> declare a repeated field (array) or a map in the protobuf schema and we then 
> try to convert it to parquet. The conversion works fine, but when we try to 
> query the data with Hive/Presto, we get some freaky errors.
> We've noticed though that AvroToParquet works great, even when we declare 
> such fields (arrays, maps)! 
> Comparing the parquet schema generated by protobuf vs avro, we've noticed a 
> few differences.
> Take the simple schema below (protobuf):
> {code}
> message ListOfList {
> string top_field = 1;
> repeated MyInnerMessage first_array = 2;
> }
> message MyInnerMessage {
> int32 inner_field = 1;
> repeated int32 second_array = 2;
> }
> {code}
> After using ProtoParquetWriter, the resulting parquet schema is the following:
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   repeated group first_array {
> optional int32 inner_field;
> repeated int32 second_array;
>   }
> }
> {code}
> When we try to query this data, we get parsing errors from Hive/Athena. The 
> parsing errors are related to the array/map fields.
> However, if we create a similar avro schema, the parquet result of the 
> AvroParquetWriter is the following:
> {code}
> message TestProtobuf.ListOfList {
>   required binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   required int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> This works beautifully with Hive/Athena. Too bad our systems are stuck with 
> protobuf :-) .
> You can see the additional wrappers which are missing from protobuf: 
> {{required group first_array (LIST)}}.
> Our goal is to make the ProtoParquetWriter generate a parquet schema similar 
> to what Avro is doing. We basically want to add these wrappers around 
> lists/maps.
> Everything seemed to work great, until we've bumped into an issue. We tuned 
> ProtoParquetWriter to generate the same parquet schema as AvroParquetWriter. 
> However, one difference between protobuf and avro is that in protobuf we can 
> have a bunch of Optional fields. 
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   optional int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> Notice the: *optional* int32 inner_field (for avro that was *required*).
> When testing with some real proto-parquet data, we get an error every time 
> inner_field is not populated, but the second_array is.
> {noformat}
> parquet-tools cat /tmp/test23.parquet
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file file:/tmp/test23.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:223)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:122)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:126)
>   at 
> org.apache.parquet.tools.command.CatCommand.execute(CatCommand.java:79)
>   at org.apache.parquet.proto.tools.Main.main(Main.java:214)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
> Caused by: