[jira] [Commented] (PARQUET-2351) ParquetWriter/ParquetReader should parse options directly from supplied Configuration

ASF GitHub Bot (Jira) Wed, 27 Sep 2023 22:19:06 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769902#comment-17769902
 ]


ASF GitHub Bot commented on PARQUET-2351:
-----------------------------------------

wgtmac commented on code in PR #1157:
URL: https://github.com/apache/parquet-mr/pull/1157#discussion_r1339538597


##########
parquet-hadoop/src/test/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilterTest.java:
##########
@@ -195,10 +200,28 @@ private static void prepareFile(WriterVersion version, 
Path file) throws IOExcep
     writeData(f, writer);
   }
 
+  private static void prepareFileWithConf(WriterVersion version, Path file) 
throws IOException {
+    Configuration configuration = new Configuration();

Review Comment:
   It seems that these specific configurations defined in the 
ParquetOutputFormat are solely used for ParquetOutputFormat to create a 
RecordWriter, which actually puts all of them into a ParquetProperties.
   
   
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L457-L484
   
   IIUC, relying on settings from Hadoop configuration is discouraged, we 
should use ParquetProperties to set all those things.



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java:
##########
@@ -358,9 +358,29 @@ public abstract static class Builder<T, SELF extends 
Builder<T, SELF>> {
     private long rowGroupSize = DEFAULT_BLOCK_SIZE;
     private int maxPaddingSize = MAX_PADDING_SIZE_DEFAULT;
     private boolean enableValidation = DEFAULT_IS_VALIDATING_ENABLED;
-    private ParquetProperties.Builder encodingPropsBuilder =
+    private final ParquetProperties.Builder encodingPropsBuilder =
         ParquetProperties.builder();
 
+    private boolean isPageSizeSet = false;

Review Comment:
   TBH, these would make the code less maintainable. Not sure if Optional would 
make them more organized.





> ParquetWriter/ParquetReader should parse options directly from supplied 
> Configuration
> -------------------------------------------------------------------------------------
>
>                 Key: PARQUET-2351
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2351
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Claire McGinty
>            Priority: Minor
>
> As a Parquet user, my expectation is that ParquetWriter/ParquetReader will 
> automatically parse any options passed in the supplied Configuration. For 
> example:
>  
> ```
> Configuration conf = new Configuration();
> conf.setBoolean(ParquetOutputFormat.BLOOM_FILTER_ENABLED, true);
> ParquetWriter<Car> writer = AvroParquetWriter.<Car>builder(path)
> .withSchema(Car.SCHEMA$)
> .withConf(conf)
> .build();
> ```
>  
> However, the above code results in a ParquetWriter where `bloomFilterEnabled` 
> is set to false-the expected way to configure this is to use 
> `ParquetWriter#withBloomFilterEnabled` directly.
>  
> IMO, when `ParquetWriter#withConf` is invoked, it should also delegate to 
> `encodingPropsBuilder` and set any Configured properties.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2351) ParquetWriter/ParquetReader should parse options directly from supplied Configuration

Reply via email to