[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

Thiruvalluvan M. G. (JIRA) Sun, 25 Nov 2018 05:25:58 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698168#comment-16698168
 ]


Thiruvalluvan M. G. commented on PARQUET-1441:
----------------------------------------------

The behavior here is due to the following. For every {{group}} in Parquet 
schema, {{AvroSchemaConverter}} constructs a record schema in Avro. Avro's 
record schemas has a required attribute {{name}}. {{AvroSchemaConverter}} 
assigns the name of the {{group}} as the {{name}} for Avro's record schema. In 
this example, there are two {{groups}} with the same name {{list}}. 
Additionally, Avro requires that schema names be unique. So the two instances 
of {{list}} as name causes Avro to complain. A simple fix is to remove the 
conflict, for example by renaming the second {{list}} to {{list2}}:
{code:java}
@Test
public void testConvertedSchemaToStringCantRedefineList() throws Exception {
  String parquet = "message spark_schema {\n" +
      "  optional group annotation {\n" +
      "    optional group transcriptEffects (LIST) {\n" +
      "      repeated group list {\n" +
      "        optional group element {\n" +
      "          optional group effects (LIST) {\n" +
      "            repeated group list2 {\n" +
      "              optional binary element (UTF8);\n" +
      "            }\n" +
      "          }\n" +
      "        }\n" +
      "      }\n" +
      "    }\n" +
      "  }\n" +
      "}\n";

  Configuration conf = new Configuration(false);
  AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
  Schema schema = 
avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
  schema.toString();
}
{code}
I've verified that this indeed fixes this test.

[~heuermh] If this workaround solves your problem, please resolve this issue 
and also the corresponding issue in Avro AVRO-2272. If this workaround is not 
possible for you to implement, please let us know why. Thank you.

> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> ------------------------------------------------------------------------
>
>                 Key: PARQUET-1441
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1441
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>            Reporter: Michael Heuer
>            Priority: Major
>              Labels: pull-request-available
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>       "  optional group annotation {\n" +
>       "    optional group transcriptEffects (LIST) {\n" +
>       "      repeated group list {\n" +
>       "        optional group element {\n" +
>       "          optional group effects (LIST) {\n" +
>       "            repeated group list {\n" +
>       "              optional binary element (UTF8);\n" +
>       "            }\n" +
>       "          }\n" +
>       "        }\n" +
>       "      }\n" +
>       "    }\n" +
>       "  }\n" +
>       "}\n";
>   Configuration conf = new Configuration(false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>       "  optional group annotation {\n" +
>       "    optional group transcriptEffects (LIST) {\n" +
>       "      repeated group list {\n" +
>       "        optional group element {\n" +
>       "          optional group effects (LIST) {\n" +
>       "            repeated group list {\n" +
>       "              optional binary element (UTF8);\n" +
>       "            }\n" +
>       "          }\n" +
>       "        }\n" +
>       "      }\n" +
>       "    }\n" +
>       "  }\n" +
>       "}\n";
>  
>   Configuration conf = new Configuration(false);
>   conf.setBoolean("parquet.avro.add-list-element-records", false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> I don't see a way to influence the code path in AvroIndexedRecordConverter to 
> respect this configuration, resulting in the following stack trace downstream
> {noformat}
>   Cause: org.apache.avro.SchemaParseException: Can't redefine: list
>   at org.apache.avro.Schema$Names.put(Schema.java:1128)
>   at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
>   at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema.toString(Schema.java:324)
>   at 
> org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
>   at 
> org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.<init>(AvroIndexedRecordConverter.java:333)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:172)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.<init>(AvroIndexedRecordConverter.java:94)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:168)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.<init>(AvroIndexedRecordConverter.java:94)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.<init>(AvroIndexedRecordConverter.java:66)
>   at 
> org.apache.parquet.avro.AvroCompatRecordMaterializer.<init>(AvroCompatRecordMaterializer.java:34)
>   at 
> org.apache.parquet.avro.AvroReadSupport.newCompatMaterializer(AvroReadSupport.java:144)
>   at 
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:136)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
> ...
> {noformat}
> See also downstream issues
> https://issues.apache.org/jira/browse/SPARK-25588
> [https://github.com/bigdatagenomics/adam/issues/2058]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

Reply via email to