[
https://issues.apache.org/jira/browse/PARQUET-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17710357#comment-17710357
]
ASF GitHub Bot commented on PARQUET-2265:
-----------------------------------------
clairemcginty opened a new pull request, #1049:
URL: https://github.com/apache/parquet-mr/pull/1049
This PR removes the default `model` setting in `AvroParquetWriter.Builder`.
The reasons for this change are:
1) `AvroWriteSupport` already sets a default `model` value (also to
`SpecificDataSupplier`) in
[AvroWriteSupport::init](https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L126-L135),
so setting a default in `AvroParquetWriter` is redundant.
2) It created some confusion for me, since I expected that the class I set
in
[parquet.avro.write.data.supplier](https://github.com/apache/parquet-mr/blob/d38044f5395494e1543581a4b763f624305d3022/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L54)
would automatically be used as the `model`. This change ensures that if no
data model is supplied in the `AvroParquetWriter.Builder`, then
`AvroWriteSupport` will select a data model based on the
`parquet.avro.write.data.supplier` config value, if present, and
`SpecificDataSupplier` if not present.
3. `AvroParquetReader`
[sets](https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetReader.java#L133)
`model = null`, so this PR makes the behavior of AvroParquetReader|Writer
consistent.
### Jira
- [x] My PR addresses the following [Parquet
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-2265
- In case you are adding a dependency, check if the license complies with
the [ASF 3rd Party License
Policy](https://www.apache.org/legal/resolved.html#category-x).
### Tests
- [x] My PR adds the following unit tests __OR__ **does not need testing for
this extremely good reason**: Since AvroWriteSupport still sets `model` to a
default value of `SpecificDataSupplier` in
[init](https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L403-L406),
the existing SpecificRecord writer tests continuing to pass should effectively
test this change. I'm happy to add a test with a custom DataSupplier!
### Commits
- [x] My commits all reference Jira issues in their subject lines. In
addition, my commits follow the guidelines from "[How to write a good git
commit message](http://chris.beams.io/posts/git-commit/)":
1. Subject is separated from body by a blank line
1. Subject is limited to 50 characters (not including Jira issue reference)
1. Subject does not end with a period
1. Subject uses the imperative mood ("add", not "adding")
1. Body wraps at 72 characters
1. Body explains "what" and "why", not "how"
### Documentation
- [ ] In case of new functionality, my PR adds documentation that describes
how to use it.
- All the public functions and the classes in the PR contain Javadoc that
explain what it does
> AvroParquetWriter should default to data supplier model from Configuration
> --------------------------------------------------------------------------
>
> Key: PARQUET-2265
> URL: https://issues.apache.org/jira/browse/PARQUET-2265
> Project: Parquet
> Issue Type: Improvement
> Reporter: Claire McGinty
> Priority: Major
>
> I recently ran into a bug where the AvroDataSupplier I specified in my
> Configuration wasn't respected when creating an AvroParquetWriter:
>
> ```
> Configuration configuration = new Configuration();
> configuration.put(AvroWriteSupport.AVRO_DATA_SUPPLIER, myCustomDataSupplier)
> AvroParquetWriter<MyAvroRecord> writer =
> AvroParquetWriter.<MyAvroRecord>builder(...)
> .withSchema(...)
> .withConf(configuration)
> .build();
> ```
> In this instance, the writer's attached AvroWriteSupport uses a SpecificData
> model, rather than the value of `myCustomDataSupplier.get()`. This is due to
> AvroParquetWriter defaulting to SpecificData model[0] if it's not supplied in
> the AvroParquetWriter.Builder.
> I see that AvroParquetWriter.Builder has a `.withDataModel` method, but IMO
> this creates confusion/redundancy, since I end up supplying the data model
> twice; also, I can't create any abstractions around this (i.e. a
> `createWriterForConfiguration(Configuration conf)` type of method) without
> having to use reflection to invoke a dataModel for the value of
> `conf.getClass(AvroWriteSupport.AVRO_DATA_SUPPLIER)`.
> I think it would be simplest if AvroWriteSupport just defaulted to `model =
> null` and let AvroWriteSupport initialize it based on the Configuration[1].
> What do you think? That seems to be what AvroParquetReader is currently
> doing[2].
>
> [0][https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L163]
> [1][https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L134]
>
> [2]https://github.com/apache/parquet-mr/blob/9a1fbc4ee3f63284a675eeac6c62e96ffc973575/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetReader.java#L133
--
This message was sent by Atlassian Jira
(v8.20.10#820010)