[
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16699034#comment-16699034
]
Thiruvalluvan M. G. commented on PARQUET-1441:
----------------------------------------------
Great point [~nkollar]! Here is something that we can do, without breaking the
rules of either system. Parquet wants the name of the repeated group to be
{{list}} and Avro does not like more than one {{list}}. Then, can we modify the
code {{AvroSchemaConverter}} to attach a unique suffix (say, a number) for name
of the generated Avro record schema? Essentially, we'd automatically do what I
proposed to do manually earlier. In order not to surprise those who used only
one {{list}}, we can make the name of the first {{list}} be still {{list}}. For
every other place we add an increasing number starting with {{1}}. Will it work?
> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> ------------------------------------------------------------------------
>
> Key: PARQUET-1441
> URL: https://issues.apache.org/jira/browse/PARQUET-1441
> Project: Parquet
> Issue Type: Bug
> Components: parquet-avro
> Reporter: Michael Heuer
> Priority: Major
> Labels: pull-request-available
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
> String parquet = "message spark_schema {\n" +
> " optional group annotation {\n" +
> " optional group transcriptEffects (LIST) {\n" +
> " repeated group list {\n" +
> " optional group element {\n" +
> " optional group effects (LIST) {\n" +
> " repeated group list {\n" +
> " optional binary element (UTF8);\n" +
> " }\n" +
> " }\n" +
> " }\n" +
> " }\n" +
> " }\n" +
> " }\n" +
> "}\n";
> Configuration conf = new Configuration(false);
> AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
> Schema schema =
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
> schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
> String parquet = "message spark_schema {\n" +
> " optional group annotation {\n" +
> " optional group transcriptEffects (LIST) {\n" +
> " repeated group list {\n" +
> " optional group element {\n" +
> " optional group effects (LIST) {\n" +
> " repeated group list {\n" +
> " optional binary element (UTF8);\n" +
> " }\n" +
> " }\n" +
> " }\n" +
> " }\n" +
> " }\n" +
> " }\n" +
> "}\n";
>
> Configuration conf = new Configuration(false);
> conf.setBoolean("parquet.avro.add-list-element-records", false);
> AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
> Schema schema =
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
> schema.toString();
> }
> {code}
> I don't see a way to influence the code path in AvroIndexedRecordConverter to
> respect this configuration, resulting in the following stack trace downstream
> {noformat}
> Cause: org.apache.avro.SchemaParseException: Can't redefine: list
> at org.apache.avro.Schema$Names.put(Schema.java:1128)
> at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
> at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
> at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
> at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
> at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
> at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
> at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
> at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
> at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
> at org.apache.avro.Schema.toString(Schema.java:324)
> at
> org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
> at
> org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
> at
> org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.<init>(AvroIndexedRecordConverter.java:333)
> at
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:172)
> at
> org.apache.parquet.avro.AvroIndexedRecordConverter.<init>(AvroIndexedRecordConverter.java:94)
> at
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:168)
> at
> org.apache.parquet.avro.AvroIndexedRecordConverter.<init>(AvroIndexedRecordConverter.java:94)
> at
> org.apache.parquet.avro.AvroIndexedRecordConverter.<init>(AvroIndexedRecordConverter.java:66)
> at
> org.apache.parquet.avro.AvroCompatRecordMaterializer.<init>(AvroCompatRecordMaterializer.java:34)
> at
> org.apache.parquet.avro.AvroReadSupport.newCompatMaterializer(AvroReadSupport.java:144)
> at
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:136)
> at
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204)
> at
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
> at
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
> ...
> {noformat}
> See also downstream issues
> https://issues.apache.org/jira/browse/SPARK-25588
> [https://github.com/bigdatagenomics/adam/issues/2058]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)