[jira] [Comment Edited] (PARQUET-2069) Parquet file containing arrays, written by Parquet-MR, cannot be read again by Parquet-MR

Timothy Miller (Jira) Fri, 13 May 2022 14:03:05 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536888#comment-17536888
 ]


Timothy Miller edited comment on PARQUET-2069 at 5/13/22 9:02 PM:
------------------------------------------------------------------

I managed to probe this just a bit. No idea why this diverges from the behavior 
when I use the simple reader program I wrote 
([https://github.com/theosib-amazon/parquet-mr-minreader]), despite using the 
exact same API calls. Specifically, I'm trying to read the modified.parquet 
that's attached to this bug report.

The exception is thrown by 
org.apache.parquet.avro.AvroRecordConverter<T>.getAvroField(), where it's 
trying to look up an element from a parquetSchema in the avroSchema. The caller 
is org.apache.parquet.avro.AvroRecordConverter<T>.AvroRecordConverter() (the 
constructor), where we can find a loop over parquetSchema.getFields().

The parquet schema it's trying to use is this:
{noformat}
optional group element {
  optional group elements (LIST) {
    repeated group array {
      optional group element {
        optional group distance {
          optional binary text (STRING);
          optional int64 value;
        }
        optional group duration {
          optional binary text (STRING);
          optional int64 value;
        }
        optional binary status (STRING);
      }
    }
  }
}{noformat}
The avro schema is this:
{code:java}
{"type":"record","name":"list","namespace":"list3","fields":[{"name":"element","type":["null",{"type":"record","name":"element","namespace":"","fields":[{"name":"elements","type":["null",{"type":"array","items":{"type":"record","name":"list","namespace":"list4","fields":[{"name":"element","type":["null",{"type":"record","name":"element","namespace":"element2","fields":[{"name":"distance","type":["null",{"type":"record","name":"distance","namespace":"","fields":[{"name":"text","type":["null","string"],"default":null},{"name":"value","type":["null","long"],"default":null}]}],"default":null},{"name":"duration","type":["null",{"type":"record","name":"duration","namespace":"","fields":[{"name":"text","type":["null","string"],"default":null},{"name":"value","type":["null","long"],"default":null}]}],"default":null},{"name":"status","type":["null","string"],"default":null}]}],"default":null}]}}],"default":null}]}],"default":null}],"aliases":["array"]}{code}
There is indeed something in the avro schema called "elements". However, 
examining the avroSchema itself that is being searched from getAvroField, the 
Schema class's fieldMap only contains entries for "rows", 
"destination_address", "origin_address", and "status". It seems like these two 
schemas have very little to do with each other and somehow the wrong kind of 
thing got passed down into this code.

 


was (Author: JIRAUSER287471):
I managed to probe this just a bit. No idea why this diverges from the behavior 
when I use the simple reader program I wrote 
(https://github.com/theosib-amazon/parquet-mr-minreader), despite using the 
exact same API calls. Specifically, I'm trying to read the modified.parquet 
that's attached to this bug report.

The exception is thrown by 
org.apache.parquet.avro.AvroRecordConverter<T>.getAvroField(), where it's 
trying to look up an element from a parquetSchema in the avroSchema. The caller 
is org.apache.parquet.avro.AvroRecordConverter<T>.AvroRecordConverter() (the 
constructor), where we can find a loop over parquetSchema.getFields().

The parquet schema it's trying to use is this:
{noformat}
optional group element {
  optional group elements (LIST) {
    repeated group array {
      optional group element {
        optional group distance {
          optional binary text (STRING);
          optional int64 value;
        }
        optional group duration {
          optional binary text (STRING);
          optional int64 value;
        }
        optional binary status (STRING);
      }
    }
  }
}{noformat}
The avro schema is this:
{noformat}
{"type":"record","name":"list","namespace":"list3","fields":[{"name":"element","type":["null",{"type":"record","name":"element","namespace":"","fields":[{"name":"elements","type":["null",{"type":"array","items":{"type":"record","name":"list","namespace":"list4","fields":[{"name":"element","type":["null",{"type":"record","name":"element","namespace":"element2","fields":[{"name":"distance","type":["null",{"type":"record","name":"distance","namespace":"","fields":[{"name":"text","type":["null","string"],"default":null},{"name":"value","type":["null","long"],"default":null}]}],"default":null},{"name":"duration","type":["null",{"type":"record","name":"duration","namespace":"","fields":[{"name":"text","type":["null","string"],"default":null},{"name":"value","type":["null","long"],"default":null}]}],"default":null},{"name":"status","type":["null","string"],"default":null}]}],"default":null}]}}],"default":null}]}],"default":null}],"aliases":["array"]}{noformat}
There is indeed something in the avro schema called "elements". However, 
examining the avroSchema itself that is being searched from getAvroField, the 
Schema class's fieldMap only contains entries for "rows", 
"destination_address", "origin_address", and "status". It seems like these two 
schemas have very little to do with each other and somehow the wrong kind of 
thing got passed down into this code.

 

> Parquet file containing arrays, written by Parquet-MR, cannot be read again 
> by Parquet-MR
> -----------------------------------------------------------------------------------------
>
>                 Key: PARQUET-2069
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2069
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.12.0
>         Environment: Windows 10
>            Reporter: Devon Kozenieski
>            Priority: Blocker
>         Attachments: modified.parquet, original.parquet, parquet-diff.png
>
>
> In the attached files, there is one original file, and one written modified 
> file that results after reading the original file and writing it back with 
> Parquet-MR, with a few values modified. The schema should not be modified, 
> since the schema of the input file is used as the schema to write the output 
> file. However, the output file has a slightly modified schema that then 
> cannot be read back the same way again with Parquet-MR, resulting in the 
> exception message:  java.lang.ClassCastException: optional binary element 
> (STRING) is not a group
> My guess is that the issue lies in the Avro schema conversion.
> The Parquet files attached have some arrays and some nested fields.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Comment Edited] (PARQUET-2069) Parquet file containing arrays, written by Parquet-MR, cannot be read again by Parquet-MR

Reply via email to