[ 
https://issues.apache.org/jira/browse/PARQUET-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536894#comment-17536894
 ] 

Timothy Miller commented on PARQUET-2069:
-----------------------------------------

So, where does the avro schema come from in the first place? This is in 
org.apache.parquet.avro.AvroReadSupport<T>.prepareForRead().

This IF test passes:

 
{code:java}
if (keyValueMetaData.get(AVRO_SCHEMA_METADATA_KEY) != null){code}
 

Using a parquet metadata analysis tool I wrote, we can see where this is coming 
from:

 
{code:java}
    5: List(KeyValue key_value_metadata):
        1: Struct(KeyValue key_value_metadata):
           1: string(key) = Binary("parquet.avro.schema")
           2: string(value) = 
Binary("{"type":"record","name":"spark_schema","fields":[{"name":"destination_addresses","type":["null",{"type":"array","items":{"type":"record","name":"list","fields":[{"name":"element","type":["null","string"],"default":null}]}}],"default":null},{"name":"origin_addresses","type":["null",{"type":"array","items":{"type":"record","name":"list","namespace":"list2","fields":[{"name":"element","type":["null","string"],"default":null}]}}],"default":null},{"name":"rows","type":["null",{"type":"array","items":{"type":"record","name":"list","namespace":"list3","fields":[{"name":"element","type":["null",{"type":"record","name":"element","namespace":"","fields":[{"name":"elements","type":["null",{"type":"array","items":{"type":"record","name":"list","namespace":"list4","fields":[{"name":"element","type":["null",{"type":"record","name":"element","namespace":"element2","fields":[{"name":"distance","type":["null",{"type":"record","name":"distance","namespace":"","fields":[{"name":"text","type":["null","string"],"default":null},{"name":"value","type":["null","long"],"default":null}]}],"default":null},{"name":"duration","type":["null",{"type":"record","name":"duration","namespace":"","fields":[{"name":"text","type":["null","string"],"default":null},{"name":"value","type":["null","long"],"default":null}]}],"default":null},{"name":"status","type":["null","string"],"default":null}]}],"default":null}]}}],"default":null}]}],"default":null}]}}],"default":null},{"name":"status","type":["null","string"],"default":null}]}"){code}
The parquet schema comes from the thrift data structures at the end of the 
parquet file. The decode of that is so lengthy that I decided to put it on 
pastebin, here: [https://pastebin.com/cRGPkSwH|[https://pastebin.com/cRGPkSwH].]


I'll try to put these side-by-side later and see if I can make sense of them. 
But I'm guessing that the writer messed up and made the two schemas not 
compatible.

If I hack prepareForRead() to ignore the avro schema, everything parses just 
fine.

Right now I'm rebuilding the whole thing where the fix provided in the PR is 
backed out. I'll post another comment to let you know how that goes.

 

 

> Parquet file containing arrays, written by Parquet-MR, cannot be read again 
> by Parquet-MR
> -----------------------------------------------------------------------------------------
>
>                 Key: PARQUET-2069
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2069
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.12.0
>         Environment: Windows 10
>            Reporter: Devon Kozenieski
>            Priority: Blocker
>         Attachments: modified.parquet, original.parquet, parquet-diff.png
>
>
> In the attached files, there is one original file, and one written modified 
> file that results after reading the original file and writing it back with 
> Parquet-MR, with a few values modified. The schema should not be modified, 
> since the schema of the input file is used as the schema to write the output 
> file. However, the output file has a slightly modified schema that then 
> cannot be read back the same way again with Parquet-MR, resulting in the 
> exception message:  java.lang.ClassCastException: optional binary element 
> (STRING) is not a group
> My guess is that the issue lies in the Avro schema conversion.
> The Parquet files attached have some arrays and some nested fields.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to