[ 
https://issues.apache.org/jira/browse/PARQUET-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163040#comment-17163040
 ] 

Fokko Driesprong commented on PARQUET-1887:
-------------------------------------------

Thanks Øyvind for pointing this out. I've created a PR that addresses the 
issue: [https://github.com/apache/parquet-mr/pull/804/]

Please let me know if this solved is for you. I would like to point out that 
changing the schema:

 

{{{}}
{{  "type": "record",}}
{{  "name": "person",}}
{{  "fields": [}}
{{    {}}
{{      "name": "address",}}
{{      "type": [}}
{{        "null",}}
{{        {}}
{{          "type": "array",}}
{{          "items": [}}
{{            "null",}}
{{            "string"}}
{{          ]}}
{{        }}}
{{      ],}}
{{      "default": null}}
{{    }}}
{{  ]}}
{{}}}

should in theory also fix it. But this isn't supported by Parquet right now. 
The easiest way should be just filtering the null's from the arrays before 
writing them :)

 

> Exception thrown by AvroParquetWriter#write causes all subsequent calls to it 
> to fail
> -------------------------------------------------------------------------------------
>
>                 Key: PARQUET-1887
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1887
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.11.0, 1.8.3
>            Reporter: Øyvind Strømmen
>            Priority: Major
>         Attachments: person1_11_0.parquet, person1_8_3.parquet
>
>
> Please see sample code below:
> {code:java}
> Schema schema = new Schema.Parser().parse("""
>         {
>           "type": "record",
>           "name": "person",
>           "fields": [
>             {
>               "name": "address",
>               "type": [
>                 "null",
>                 {
>                   "type": "array",
>                   "items": "string"
>                 }
>               ],
>               "default": null
>             }
>           ]
>         }
>         """
> );
> ParquetWriter<GenericRecord> writer = 
> AvroParquetWriter.<GenericRecord>builder(new 
> org.apache.hadoop.fs.Path("/tmp/person.parquet"))
>         .withSchema(schema)
>         .build();
> try {
>     // To trigger exception, add array with null element.
>     writer.write(new GenericRecordBuilder(schema).set("address", 
> Arrays.asList("first", null, "last")).build());
> } catch (Exception e) {
>     e.printStackTrace(); // "java.lang.NullPointerException: Array contains a 
> null element at 1"
> }
> try {
>     // At this point all future calls to writer.write will fail
>     writer.write(new GenericRecordBuilder(schema).set("address", 
> Arrays.asList("foo", "bar")).build());
> } catch (Exception e) {
>     e.printStackTrace(); // "org.apache.parquet.io.InvalidRecordException: 
> 1(r) > 0 ( schema r)"
> }
> writer.close();
> {code}
> It seems to me this is caused by state not being reset between writes. Is 
> this the indented behavior of the writer? And if so, does one have to create 
> a new writer whenever a write fails?
> I'm able to reproduce this using both parquet 1.8.3 and 1.11.0, and have 
> attached a sample parquet file for each version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to