[jira] [Commented] (PARQUET-2069) Parquet file containing arrays, written by Parquet-MR, cannot be read again by Parquet-MR

Timothy Miller (Jira) Thu, 14 Apr 2022 07:38:05 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522347#comment-17522347
 ]


Timothy Miller commented on PARQUET-2069:
-----------------------------------------

This appears to occur due to the reader and writer schemas having a name 
mismatch on the record I mentioned earlier. The problem appears to be either in 
Avro or in how Avro is used.

When comparing these structures for compatibility, there this method in 
org.apache.avro.SchemaCompatibility is being used:

{{  public static boolean schemaNameEquals(final Schema reader, final Schema 
writer) {}}
{{    if (objectsEqual(reader.getName(), writer.getName())) {}}
{{      return true;}}
{{    }}}
{{    // Apply reader aliases:}}
{{    return reader.getAliases().contains(writer.getFullName());}}
{{  }}}

Evidently, when the reader schema was set up, no alias was added that would 
allow "list" and "array" to match here, and this causes the failure. This 
method is called by checkSchemaNames, which is called by 
calculateCompatibility, which is called by getCompatibility, which is called by 
another getCompatibility, which is called from checkReaderWriterCompatibility, 
which is called from org.apache.parquet.avro.AvroRecordConverter.isElementType.

 

> Parquet file containing arrays, written by Parquet-MR, cannot be read again 
> by Parquet-MR
> -----------------------------------------------------------------------------------------
>
>                 Key: PARQUET-2069
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2069
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.12.0
>         Environment: Windows 10
>            Reporter: Devon Kozenieski
>            Priority: Blocker
>         Attachments: modified.parquet, original.parquet, parquet-diff.png
>
>
> In the attached files, there is one original file, and one written modified 
> file that results after reading the original file and writing it back with 
> Parquet-MR, with a few values modified. The schema should not be modified, 
> since the schema of the input file is used as the schema to write the output 
> file. However, the output file has a slightly modified schema that then 
> cannot be read back the same way again with Parquet-MR, resulting in the 
> exception message:  java.lang.ClassCastException: optional binary element 
> (STRING) is not a group
> My guess is that the issue lies in the Avro schema conversion.
> The Parquet files attached have some arrays and some nested fields.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2069) Parquet file containing arrays, written by Parquet-MR, cannot be read again by Parquet-MR

Reply via email to