Claire McGinty created PARQUET-2265:
---------------------------------------

             Summary: AvroParquetWriter should default to data supplier model 
from Configuration
                 Key: PARQUET-2265
                 URL: https://issues.apache.org/jira/browse/PARQUET-2265
             Project: Parquet
          Issue Type: Improvement
            Reporter: Claire McGinty


I recently ran into a bug where the AvroDataSupplier I specified in my 
Configuration wasn't respected when creating an AvroParquetWriter:

 

```

Configuration configuration = new Configuration();
configuration.put(AvroWriteSupport.AVRO_DATA_SUPPLIER, myCustomDataSupplier)

AvroParquetWriter<MyAvroRecord> writer =
  AvroParquetWriter.<MyAvroRecord>builder(...)
    .withSchema(...)
    .withConf(configuration)
    .build();

```

In this instance, the writer's attached AvroWriteSupport uses a SpecificData 
model, rather than the value of `myCustomDataSupplier.get()`. This is due to 
AvroParquetWriter defaulting to SpecificData model[0] if it's not supplied in 
the AvroParquetWriter.Builder.

I see that AvroParquetWriter.Builder has a `.withDataModel` method, but IMO 
this creates confusion/redundancy, since I end up supplying the data model 
twice; also, I can't create any abstractions around this (i.e. a 
`createWriterForConfiguration(Configuration conf)` type of method) without 
having to use reflection to invoke a dataModel for the value of 
`conf.getClass(AvroWriteSupport.AVRO_DATA_SUPPLIER)`.



I think it would be simplest if AvroWriteSupport just defaulted to `model = 
null` and let AvroWriteSupport initialize it based on the Configuration[1]. 
What do you think? 

 

[0][https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L163]

[1][https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L134]
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to