I was able to get something working locally. I'll open a JIRA and have a PR 
once I have sufficient tests in place.

// ah

From: Hailu, Andreas [Engineering]
Sent: Friday, May 14, 2021 12:09 PM
To: [email protected]
Subject: AvroParquetOutputFormat - Unable to Write Arrays with Null Elements

Hi folks, I'm using v1.11.1 of the parquet-mr library as part of a Java 
application that takes Avro records and writes them into Parquet files using 
the AvroParquetOutputFormat. There are Avro records with array type fields that 
will have null elements, e.g. [ "Foo", "Bar", null, "Baz"]. Here's an example 
Avro schema:

{
  "type": "record",
  "name": "NullLists",
  "namespace": "com.test",
  "fields": [
    {
      "name": "KeyID",
      "type": "string"
    },
    {
      "name": "NullableList",
      "type": [
        "null",
        {
            "type": "array",
            "items": [
                "null",
                "string"
            ]
        }
      ],
      "default": null
    }
  ]
}

I'm trying to write the following record:

{
  "KeyID": "0",
  "NullableList": [
    "foo",
    null,
    "baz"
  ]
}

I thought I could use the 3-level list writer to support this, however, it 
results in the following exception:

Caused by: java.lang.ClassCastException: repeated binary array (STRING) is not 
a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:250)
        at 
org.apache.parquet.avro.AvroWriteSupport$ThreeLevelListWriter.writeCollection(AvroWriteSupport.java:612)
        at 
org.apache.parquet.avro.AvroWriteSupport$ListWriter.writeList(AvroWriteSupport.java:397)
        at 
org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:355)
        at 
org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:278)
        at 
org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
        at 
org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
        at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)

Is this kind of record supported? I have also tried the 
"parquet.avro.add-list-element-records" option set to false as well, with no 
luck.

____________

Andreas Hailu
Data Lake Engineering | Goldman Sachs & Co.


________________________________

Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

Reply via email to