[jira] [Comment Edited] (PARQUET-2069) Parquet file containing arrays, written by Parquet-MR, cannot be read again by Parquet-MR

Timothy Miller (Jira) Thu, 07 Apr 2022 08:54:06 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518968#comment-17518968
 ]


Timothy Miller edited comment on PARQUET-2069 at 4/7/22 3:53 PM:
-----------------------------------------------------------------

An  initial look at this suggests that the writer is corrupting the data 
somehow. I'm using "parquet-tools dump", and there are a bunch of (I guess) 
harmless changes from "list" to "array", but there's also two examples of data 
being corrupted. See the screenshot.

There, you can see that the string "Washington, DC, USA" is coming out of the 
modified file as "oX/xgBpN04At7pWhlc6/7kX0nIdGGp4EJCY2guUZ/vY=". It's probably 
worse than that, but the command line tool is doing its best to make sense of a 
corrupted file.


was (Author: JIRAUSER287471):
An  initial look at this suggests that the writer is corrupting the data 
somehow. I'm using "parquet-tools dump", and there are a bunch of (I guess) 
harmless changes from "list" to "array", but there's also two examples of data 
being corrupted. See the screenshot.

> Parquet file containing arrays, written by Parquet-MR, cannot be read again 
> by Parquet-MR
> -----------------------------------------------------------------------------------------
>
>                 Key: PARQUET-2069
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2069
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.12.0
>         Environment: Windows 10
>            Reporter: Devon Kozenieski
>            Priority: Blocker
>         Attachments: modified.parquet, original.parquet, parquet-diff.png
>
>
> In the attached files, there is one original file, and one written modified 
> file that results after reading the original file and writing it back with 
> Parquet-MR, with a few values modified. The schema should not be modified, 
> since the schema of the input file is used as the schema to write the output 
> file. However, the output file has a slightly modified schema that then 
> cannot be read back the same way again with Parquet-MR, resulting in the 
> exception message:  java.lang.ClassCastException: optional binary element 
> (STRING) is not a group
> My guess is that the issue lies in the Avro schema conversion.
> The Parquet files attached have some arrays and some nested fields.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (PARQUET-2069) Parquet file containing arrays, written by Parquet-MR, cannot be read again by Parquet-MR

Reply via email to