[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

ASF GitHub Bot (JIRA) Thu, 29 Nov 2018 07:19:43 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703321#comment-16703321
 ]


ASF GitHub Bot commented on PARQUET-1441:
-----------------------------------------

nandorKollar opened a new pull request #560: PARQUET-1441: 
SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
URL: https://github.com/apache/parquet-mr/pull/560
 
 
   Parquet Avro reader can't convert Parquet such a schema where a group field 
name is reused
   in an inner structure. The converter creates Avro record schma in this case,
   but in Avro record types should have a unique name, therefore the result is 
an invalid Avro
   schema. This patch fixes this case by adding a namespace for the record if 
the name was
   defined before, this way making the record names unique.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> ------------------------------------------------------------------------
>
>                 Key: PARQUET-1441
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1441
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>            Reporter: Michael Heuer
>            Priority: Major
>              Labels: pull-request-available
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>       "  optional group annotation {\n" +
>       "    optional group transcriptEffects (LIST) {\n" +
>       "      repeated group list {\n" +
>       "        optional group element {\n" +
>       "          optional group effects (LIST) {\n" +
>       "            repeated group list {\n" +
>       "              optional binary element (UTF8);\n" +
>       "            }\n" +
>       "          }\n" +
>       "        }\n" +
>       "      }\n" +
>       "    }\n" +
>       "  }\n" +
>       "}\n";
>   Configuration conf = new Configuration(false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>       "  optional group annotation {\n" +
>       "    optional group transcriptEffects (LIST) {\n" +
>       "      repeated group list {\n" +
>       "        optional group element {\n" +
>       "          optional group effects (LIST) {\n" +
>       "            repeated group list {\n" +
>       "              optional binary element (UTF8);\n" +
>       "            }\n" +
>       "          }\n" +
>       "        }\n" +
>       "      }\n" +
>       "    }\n" +
>       "  }\n" +
>       "}\n";
>  
>   Configuration conf = new Configuration(false);
>   conf.setBoolean("parquet.avro.add-list-element-records", false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> I don't see a way to influence the code path in AvroIndexedRecordConverter to 
> respect this configuration, resulting in the following stack trace downstream
> {noformat}
>   Cause: org.apache.avro.SchemaParseException: Can't redefine: list
>   at org.apache.avro.Schema$Names.put(Schema.java:1128)
>   at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
>   at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema.toString(Schema.java:324)
>   at 
> org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
>   at 
> org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.<init>(AvroIndexedRecordConverter.java:333)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:172)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.<init>(AvroIndexedRecordConverter.java:94)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:168)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.<init>(AvroIndexedRecordConverter.java:94)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.<init>(AvroIndexedRecordConverter.java:66)
>   at 
> org.apache.parquet.avro.AvroCompatRecordMaterializer.<init>(AvroCompatRecordMaterializer.java:34)
>   at 
> org.apache.parquet.avro.AvroReadSupport.newCompatMaterializer(AvroReadSupport.java:144)
>   at 
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:136)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
> ...
> {noformat}
> See also downstream issues
> https://issues.apache.org/jira/browse/SPARK-25588
> [https://github.com/bigdatagenomics/adam/issues/2058]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

Reply via email to