James, I think this is a bug. Parquet turns nested structures into a static number of columns, but this is limited to types that aren't recursive. Recursive types are not currently supported because it would be pretty difficult for a columnar format like Parquet. The bug is that the Protobuf library should detect the recursive type and reject it with an exception rather than entering an infinite recursion.

Would you mind filing a bug for this in the issue tracker? It is here:

  https://issues.apache.org/jira/browse/PARQUET

Thanks!

rb

On 01/12/2016 11:05 AM, McCudden, James wrote:
I have a Protocol buffer defined as such:

message A
{
                 optional string id = 1;
                 repeated B  extension = 2;
}

The B message is:

Message B
{
                 optional string id = 1;
                 repeated B  extension = 2;
}

The self referencing message "B" causes a recursrive infinite loop when trying 
to write an object of type A to parquet:

         public void writeMessages(Class<? extends Message> cls, Path file, 
List<MessageOrBuilder> records)
                         throws IOException {

                 ParquetWriter writer = new ProtoParquetWriter(  file, cls);

                 try {
                         for (MessageOrBuilder record : records) {
                                 writer.write(record);
                         }
                 } finally {
                         writer.close();
                 }
         }

Message objects without the self-referencing fields write with errors.  The 
recursive loop occurs during the field discovery from the class.  Here is  a 
stack trace from a spark-shell run:
Exception in thread "main" java.lang.StackOverflowError
         at java.util.HashMap.inflateTable(HashMap.java:317)
         at java.util.HashMap.put(HashMap.java:488)
         at org.apache.parquet.schema.GroupType.<init>(GroupType.java:97)
         at 
org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:624)
         at 
org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497)
         at org.apache.parquet.schema.Types$Builder.named(Types.java:286)
         at 
org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:67)
         at 
org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:98)
         at 
org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:67)
         at 
org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:98)
         at 
org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:67)
         at 
org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:98)
         at 
org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:67)
<...  repeats until stack failure ...>

Is this a known issue or is some option to pass to the ProtoParquetWriter?  I 
haven't seen anything obvious

Thanks


James McCudden
Architect
Relay Health Intelligence

413.587.6819 Office
413.835.5441 Mobile

RelayHealth

A division of McKesson

Confidentiality Notice: This e-mail message, including any attachments, is for 
the sole use of the intended recipient(s) and may contain confidential and 
privileged information. Any unauthorized review, use, disclosure or 
distribution is prohibited. If you are not the intended recipient, please 
contact the sender by reply e-mail and destroy all copies of the original 
message




--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to