[jira] [Commented] (PARQUET-2069) Parquet file containing arrays, written by Parquet-MR, cannot be read again by Parquet-MR

ASF GitHub Bot (Jira) Mon, 20 Jun 2022 07:30:06 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556405#comment-17556405
 ]


ASF GitHub Bot commented on PARQUET-2069:
-----------------------------------------

theosib-amazon commented on code in PR #957:
URL: https://github.com/apache/parquet-mr/pull/957#discussion_r901733632


##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java:
##########
@@ -136,10 +137,22 @@ public RecordMaterializer<T> prepareForRead(
 
     GenericData model = getDataModel(configuration);
     String compatEnabled = metadata.get(AvroReadSupport.AVRO_COMPATIBILITY);
-    if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
-      return newCompatMaterializer(parquetSchema, avroSchema, model);
+
+    try {
+      if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
+        return newCompatMaterializer(parquetSchema, avroSchema, model);
+      }
+      return new AvroRecordMaterializer<T>(parquetSchema, avroSchema, model);
+    } catch (InvalidRecordException | ClassCastException e) {

Review Comment:
   I think the underlying problem is that some versions of ParquetMR produce 
*bad schemas*, so when we try to load those same files, parsing fails, since 
the Parquet schema implicit in the file metadata doesn't match up with the 
stored Avro schema. I'm not sure what to do about bad schemas other than to 
throw them away and try a fallback.





> Parquet file containing arrays, written by Parquet-MR, cannot be read again 
> by Parquet-MR
> -----------------------------------------------------------------------------------------
>
>                 Key: PARQUET-2069
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2069
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.12.0
>         Environment: Windows 10
>            Reporter: Devon Kozenieski
>            Priority: Blocker
>         Attachments: modified.parquet, original.parquet, parquet-diff.png
>
>
> In the attached files, there is one original file, and one written modified 
> file that results after reading the original file and writing it back with 
> Parquet-MR, with a few values modified. The schema should not be modified, 
> since the schema of the input file is used as the schema to write the output 
> file. However, the output file has a slightly modified schema that then 
> cannot be read back the same way again with Parquet-MR, resulting in the 
> exception message:  java.lang.ClassCastException: optional binary element 
> (STRING) is not a group
> My guess is that the issue lies in the Avro schema conversion.
> The Parquet files attached have some arrays and some nested fields.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (PARQUET-2069) Parquet file containing arrays, written by Parquet-MR, cannot be read again by Parquet-MR

Reply via email to