[ 
https://issues.apache.org/jira/browse/PARQUET-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720053#comment-17720053
 ] 

ASF GitHub Bot commented on PARQUET-2265:
-----------------------------------------

clairemcginty opened a new pull request, #1091:
URL: https://github.com/apache/parquet-mr/pull/1091

   Per discussion here 
https://lists.apache.org/thread/0qmwj33vohxymsfc1tw1f3141zrn62yp
   
   ### Jira
   
   - [z] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
     - https://issues.apache.org/jira/browse/PARQUET-XXX
     - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
     - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> AvroParquetWriter should default to data supplier model from Configuration
> --------------------------------------------------------------------------
>
>                 Key: PARQUET-2265
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2265
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Claire McGinty
>            Assignee: Claire McGinty
>            Priority: Major
>             Fix For: 1.14.0
>
>
> I recently ran into a bug where the AvroDataSupplier I specified in my 
> Configuration wasn't respected when creating an AvroParquetWriter:
>  
> ```
> Configuration configuration = new Configuration();
> configuration.put(AvroWriteSupport.AVRO_DATA_SUPPLIER, myCustomDataSupplier)
> AvroParquetWriter<MyAvroRecord> writer =
>   AvroParquetWriter.<MyAvroRecord>builder(...)
>     .withSchema(...)
>     .withConf(configuration)
>     .build();
> ```
> In this instance, the writer's attached AvroWriteSupport uses a SpecificData 
> model, rather than the value of `myCustomDataSupplier.get()`. This is due to 
> AvroParquetWriter defaulting to SpecificData model[0] if it's not supplied in 
> the AvroParquetWriter.Builder.
> I see that AvroParquetWriter.Builder has a `.withDataModel` method, but IMO 
> this creates confusion/redundancy, since I end up supplying the data model 
> twice; also, I can't create any abstractions around this (i.e. a 
> `createWriterForConfiguration(Configuration conf)` type of method) without 
> having to use reflection to invoke a dataModel for the value of 
> `conf.getClass(AvroWriteSupport.AVRO_DATA_SUPPLIER)`.
> I think it would be simplest if AvroWriteSupport just defaulted to `model = 
> null` and let AvroWriteSupport initialize it based on the Configuration[1]. 
> What do you think? That seems to be what AvroParquetReader is currently 
> doing[2].
>  
> [0][https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L163]
> [1][https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L134]
>  
> [2]https://github.com/apache/parquet-mr/blob/9a1fbc4ee3f63284a675eeac6c62e96ffc973575/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetReader.java#L133



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to