from:"Michael Heuer \(JIRA\)"

[jira] [Commented] (PARQUET-1976) Use net.alchim31.maven:scala-maven-plugin instead of org.scala-tools:maven-scala-plugin

2021-02-09 Thread Michael Heuer (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281920#comment-17281920
 ] 

Michael Heuer commented on PARQUET-1976:


Re: Scala 2.12.12, note comment at

https://issues.apache.org/jira/browse/SPARK-33921?focusedCommentId=17255394=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17255394

It might be better to stop at Scala 2.12.10 as Spark 3.1.x does or jump ahead 
to Scala 2.12.13.

> Use net.alchim31.maven:scala-maven-plugin instead of 
> org.scala-tools:maven-scala-plugin
> ---
>
> Key: PARQUET-1976
> URL: https://issues.apache.org/jira/browse/PARQUET-1976
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Martin Tzvetanov Grigorov
>Priority: Minor
>
> org.scala-tools:maven-scala-plugin is not maintained since a long time.
> [net.alchim31.maven:scala-maven-plugin|https://github.com/davidB/scala-maven-plugin]
>  is the replacement.
> Also Scala version could be upgraded from 2.12.8 to 2.12.12
> Few other Maven plugins also could be upgraded.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1894) Please fix the related Shaded Jackson Databind CVEs

2020-08-01 Thread Michael Heuer (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169321#comment-17169321
 ] 

Michael Heuer commented on PARQUET-1894:


I would love to hear otherwise, but I believe Spark is blocked from upgrading 
Parquet due to the incompatible transitive Avro upgrade.

> Please fix the related Shaded Jackson Databind CVEs
> ---
>
> Key: PARQUET-1894
> URL: https://issues.apache.org/jira/browse/PARQUET-1894
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Rodney Aaron Stainback
>Priority: Major
>
> The following CVEs are all related to version 2.9.10 of Jackson databind 
> which you shade
> |cve|severity|cvss|
> |CVE-2019-16942|critical|9.8|
> |CVE-2019-16943|critical|9.8|
> |CVE-2019-17531|critical|9.8|
> |CVE-2019-20330|critical|9.8|
> |CVE-2020-10672|high|8.8|
> |CVE-2020-10673|high|8.8|
> |CVE-2020-10968|high|8.8|
> |CVE-2020-10969|high|8.8|
> |CVE-2020-1|high|8.8|
> |CVE-2020-2|high|8.8|
> |CVE-2020-3|high|8.8|
> |CVE-2020-11619|critical|9.8|
> |CVE-2020-11620|critical|9.8|
> |CVE-2020-14060|high|8.1|
> |CVE-2020-14061|high|8.1|
> |CVE-2020-14062|high|8.1|
> |CVE-2020-14195|high|8.1|
> |CVE-2020-8840|critical|9.8|
> |CVE-2020-9546|critical|9.8|
> |CVE-2020-9547|critical|9.8|
> |CVE-2020-9548|critical|9.8|
>  
> Our security team is trying to block us from using parquet files because of 
> this issue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1758) InternalParquetRecordReader Logging it Too Verbose

2020-01-12 Thread Michael Heuer (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013869#comment-17013869
 ] 

Michael Heuer commented on PARQUET-1758:


+1, excessive logging from Parquet has been a pain for us downstream for many 
years

> InternalParquetRecordReader Logging it Too Verbose
> --
>
> Key: PARQUET-1758
> URL: https://issues.apache.org/jira/browse/PARQUET-1758
> Project: Parquet
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
>
> A low-level library like Parquet should be pretty quiet.  It should just do 
> its work and keep quiet.  Most issues should be addressed by throwing 
> Exceptions, and the occasional warning message otherwise it will clutter the 
> logging for the top-level application.  If debugging is required, 
> administrator can enable it for the specific workload.
> *Warning:* This is my opinion. No stats to back it up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1645) Bump Apache Avro to 1.9.1

2019-11-07 Thread Michael Heuer (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969351#comment-16969351
 ] 

Michael Heuer commented on PARQUET-1645:


I am very curious about this – Parquet vs Avro version incompatibilities have 
been a source of major headache for us downstream of Apache Spark.  Will Spark 
be able to accept Avro 1.9.1 and Parquet 1.11.0 upgrades simultaneously?

> Bump Apache Avro to 1.9.1
> -
>
> Key: PARQUET-1645
> URL: https://issues.apache.org/jira/browse/PARQUET-1645
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1241) [C++] Use LZ4 frame format

2019-11-03 Thread Michael Heuer (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965675#comment-16965675
 ] 

Michael Heuer commented on PARQUET-1241:


For JVM implementations, note that Apache Commons Compress has support for both 
block and frame compression

[https://github.com/apache/commons-compress/tree/master/src/main/java/org/apache/commons/compress/compressors/lz4]

It appears that it can detect frame LZ4 from an input stream but not block

[https://github.com/apache/commons-compress/blob/master/src/main/java/org/apache/commons/compress/compressors/CompressorStreamFactory.java#L466]

> [C++] Use LZ4 frame format
> --
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

2018-11-27 Thread Michael Heuer (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700834#comment-16700834
 ] 

Michael Heuer commented on PARQUET-1441:


Sorry, which compatibility check and commit?  I'm also confused by the version 
numbers in your comment, both Parquet and Avro have made 1.8.2 releases.


The regression is complicated and perhaps not worth discussing here, by Spark 
moving to Parquet 1.10 and Avro 1.8.2 our [previous workaround of pinning 
parquet-avro to 
1.8.1|https://github.com/bigdatagenomics/adam/blob/master/pom.xml#L520] no 
longer works.  That workaround was necessary because Spark depended on Parquet 
1.8.2 and Avro 1.7.x which were incompatible with each other.

> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> 
>
> Key: PARQUET-1441
> URL: https://issues.apache.org/jira/browse/PARQUET-1441
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Reporter: Michael Heuer
>Priority: Major
>  Labels: pull-request-available
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>   Configuration conf = new Configuration(false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>  
>   Configuration conf = new Configuration(false);
>   conf.setBoolean("parquet.avro.add-list-element-records", false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> I don't see a way to influence the code path in AvroIndexedRecordConverter to 
> respect this configuration, resulting in the following stack trace downstream
> {noformat}
>   Cause: org.apache.avro.SchemaParseException: Can't redefine: list
>   at org.apache.avro.Schema$Names.put(Schema.java:1128)
>   at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
>   at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema.toString(Schema.java:324)
>   at 
> org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
>   at 
> org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.(AvroIndexedRecordConverter.java:333)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:172)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:168)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94)
>   at 
>

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

2018-11-26 Thread Michael Heuer (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699567#comment-16699567
 ] 

Michael Heuer commented on PARQUET-1441:


Note as mentioned above that while 
{{parquet.avro.add-list-element-records=false}} works in the unit tests, it 
does not appear work with AvroIndexedRecordConverter, which is what we hit 
downstream in Spark.

As far as workarounds, I'm afraid we're so far downstream that I'm not sure we 
would be able to use one.  We use Avro AVDL to generate Java objects for 
persisting Spark RDDs to Parquet and separately to generate Scala products for 
persisting Spark Datasets to Parquet.  Spark generates the schema for these 
Datasets-as-Parquet.  Up until Spark version 2.4.0, which bumped Parquet to 
version 1.10 and Avro to 1.8.2, we could write out Datasets-as-Parquet and read 
in RDDs-as-Parquet without trouble (the two different schema were considered 
compatible).

> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> 
>
> Key: PARQUET-1441
> URL: https://issues.apache.org/jira/browse/PARQUET-1441
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Reporter: Michael Heuer
>Priority: Major
>  Labels: pull-request-available
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>   Configuration conf = new Configuration(false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>  
>   Configuration conf = new Configuration(false);
>   conf.setBoolean("parquet.avro.add-list-element-records", false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> I don't see a way to influence the code path in AvroIndexedRecordConverter to 
> respect this configuration, resulting in the following stack trace downstream
> {noformat}
>   Cause: org.apache.avro.SchemaParseException: Can't redefine: list
>   at org.apache.avro.Schema$Names.put(Schema.java:1128)
>   at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
>   at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema.toString(Schema.java:324)
>   at 
> org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
>   at 
> org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.(AvroIndexedRecordConverter.java:333)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:172)
>   at 
>

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

2018-10-10 Thread Michael Heuer (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645480#comment-16645480
 ] 

Michael Heuer commented on PARQUET-1441:


I've found I can get a similar stack trace going through AvroRecordConverter 
instead of AvroIndexedRecordConverter, by setting parquet.avro.compatible to 
false
{code:scala}
val job = HadoopUtil.newJob(sc)
val conf = ContextUtil.getConfiguration(job)
conf.setBoolean("parquet.avro.compatible", false)
{code}

{noformat}
  Cause: org.apache.avro.SchemaParseException: Can't redefine: list
  at org.apache.avro.Schema$Names.put(Schema.java:1128)
  at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
  at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
  at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
  at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
  at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
  at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
  at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
  at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
  at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
  at org.apache.avro.Schema.toString(Schema.java:324)
  at 
org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
  at 
org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
  at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:475)
  at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:289)
  at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:141)
  at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279)
  at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:141)
  at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:95)
  at 
org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)
  at 
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
  at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204)
  at 
org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
  at 
org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
...
{noformat}

> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> 
>
> Key: PARQUET-1441
> URL: https://issues.apache.org/jira/browse/PARQUET-1441
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Reporter: Michael Heuer
>Priority: Major
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>   Configuration conf = new Configuration(false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>  
>   Configuration conf = new Configuration(false);
>   conf.setBoolean("parquet.avro.add-list-element-records", false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>

[jira] [Created] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

2018-10-09 Thread Michael Heuer (JIRA)

Michael Heuer created PARQUET-1441:
--

 Summary: SchemaParseException: Can't redefine: list in 
AvroIndexedRecordConverter
 Key: PARQUET-1441
 URL: https://issues.apache.org/jira/browse/PARQUET-1441
 Project: Parquet
  Issue Type: Bug
  Components: parquet-avro
Reporter: Michael Heuer


The following unit test added to TestAvroSchemaConverter fails
{code:java}
@Test
public void testConvertedSchemaToStringCantRedefineList() throws Exception {
  String parquet = "message spark_schema {\n" +
  "  optional group annotation {\n" +
  "optional group transcriptEffects (LIST) {\n" +
  "  repeated group list {\n" +
  "optional group element {\n" +
  "  optional group effects (LIST) {\n" +
  "repeated group list {\n" +
  "  optional binary element (UTF8);\n" +
  "}\n" +
  "  }\n" +
  "}\n" +
  "  }\n" +
  "}\n" +
  "  }\n" +
  "}\n";

  Configuration conf = new Configuration(false);
  AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
  Schema schema = 
avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
  schema.toString();
}
{code}

while this one succeeds
{code:java}
@Test
public void testConvertedSchemaToStringCantRedefineList() throws Exception {
  String parquet = "message spark_schema {\n" +
  "  optional group annotation {\n" +
  "optional group transcriptEffects (LIST) {\n" +
  "  repeated group list {\n" +
  "optional group element {\n" +
  "  optional group effects (LIST) {\n" +
  "repeated group list {\n" +
  "  optional binary element (UTF8);\n" +
  "}\n" +
  "  }\n" +
  "}\n" +
  "  }\n" +
  "}\n" +
  "  }\n" +
  "}\n";
 
  Configuration conf = new Configuration(false);
  conf.setBoolean("parquet.avro.add-list-element-records", false);
  AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
  Schema schema = 
avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
  schema.toString();
}
{code}

I don't see a way to influence the code path in AvroIndexedRecordConverter to 
respect this configuration, resulting in the following stack trace downstream
{noformat}
  Cause: org.apache.avro.SchemaParseException: Can't redefine: list
  at org.apache.avro.Schema$Names.put(Schema.java:1128)
  at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
  at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
  at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
  at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
  at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
  at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
  at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
  at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
  at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
  at org.apache.avro.Schema.toString(Schema.java:324)
  at 
org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
  at 
org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
  at 
org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.(AvroIndexedRecordConverter.java:333)
  at 
org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:172)
  at 
org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94)
  at 
org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:168)
  at 
org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94)
  at 
org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:66)
  at 
org.apache.parquet.avro.AvroCompatRecordMaterializer.(AvroCompatRecordMaterializer.java:34)
  at 
org.apache.parquet.avro.AvroReadSupport.newCompatMaterializer(AvroReadSupport.java:144)
  at 
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:136)
  at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204)
  at 
org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
  at 
org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
...
{noformat}

See also downstream issues
https://issues.apache.org/jira/browse/SPARK-25588
[https://github.com/bigdatagenomics/adam/issues/2058]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1976) Use net.alchim31.maven:scala-maven-plugin instead of org.scala-tools:maven-scala-plugin

[jira] [Commented] (PARQUET-1894) Please fix the related Shaded Jackson Databind CVEs

[jira] [Commented] (PARQUET-1758) InternalParquetRecordReader Logging it Too Verbose

[jira] [Commented] (PARQUET-1645) Bump Apache Avro to 1.9.1

[jira] [Commented] (PARQUET-1241) [C++] Use LZ4 frame format

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

[jira] [Created] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

9 matches

Site Navigation

Mail list logo

Footer information