[ https://issues.apache.org/jira/browse/PARQUET-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gang Wu reassigned PARQUET-2265: -------------------------------- Assignee: Claire McGinty > AvroParquetWriter should default to data supplier model from Configuration > -------------------------------------------------------------------------- > > Key: PARQUET-2265 > URL: https://issues.apache.org/jira/browse/PARQUET-2265 > Project: Parquet > Issue Type: Improvement > Reporter: Claire McGinty > Assignee: Claire McGinty > Priority: Major > Fix For: 1.14.0 > > > I recently ran into a bug where the AvroDataSupplier I specified in my > Configuration wasn't respected when creating an AvroParquetWriter: > > ``` > Configuration configuration = new Configuration(); > configuration.put(AvroWriteSupport.AVRO_DATA_SUPPLIER, myCustomDataSupplier) > AvroParquetWriter<MyAvroRecord> writer = > AvroParquetWriter.<MyAvroRecord>builder(...) > .withSchema(...) > .withConf(configuration) > .build(); > ``` > In this instance, the writer's attached AvroWriteSupport uses a SpecificData > model, rather than the value of `myCustomDataSupplier.get()`. This is due to > AvroParquetWriter defaulting to SpecificData model[0] if it's not supplied in > the AvroParquetWriter.Builder. > I see that AvroParquetWriter.Builder has a `.withDataModel` method, but IMO > this creates confusion/redundancy, since I end up supplying the data model > twice; also, I can't create any abstractions around this (i.e. a > `createWriterForConfiguration(Configuration conf)` type of method) without > having to use reflection to invoke a dataModel for the value of > `conf.getClass(AvroWriteSupport.AVRO_DATA_SUPPLIER)`. > I think it would be simplest if AvroWriteSupport just defaulted to `model = > null` and let AvroWriteSupport initialize it based on the Configuration[1]. > What do you think? That seems to be what AvroParquetReader is currently > doing[2]. > > [0][https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L163] > [1][https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L134] > > [2]https://github.com/apache/parquet-mr/blob/9a1fbc4ee3f63284a675eeac6c62e96ffc973575/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetReader.java#L133 -- This message was sent by Atlassian Jira (v8.20.10#820010)