jerolba opened a new issue, #3300:
URL: https://github.com/apache/parquet-java/issues/3300

   ### Describe the enhancement requested
   
   **Background**
   
   I am the developer of [Carpet](https://github.com/jerolba/parquet-carpet), a 
library that builds upon the Parquet Java project. 
   
   Carpet reuses the primitives provided by Parquet Java library: to 
instantiate `ParquetWriter` and `ParquetReader` classes, I use and extend 
existing builders. To support all Parquet Java features transparently to users, 
Carpet exposes the same configuration options as the underlying library, and 
transitively reuse all logic implemented by `ParquetWriter`, 
`InternalParquetRecordWriter`, `ParquetReader`  and 
`InternalParquetRecordReader`.
   
   My goal is to implement data partitioning and parallel read/write 
capabilities in Carpet. I want to continue reusing existing builders and logic, 
but with additional abstractions for partitioning and parallelism. To support 
this, I need to build multiple instances of ParquetWriter/ParquetReader based 
on a single builder configuration, only changing the target file to write/read 
before calling the build method.
   
   With the 
[existing](https://github.com/apache/parquet-java/blob/8be0dadaea9ded29d61fb10afb6dfe7d516ee316/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L479)
 
[builders](https://github.com/apache/parquet-java/blob/8be0dadaea9ded29d61fb10afb6dfe7d516ee316/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L217),
 it's not possible to implement this feature because an Input/OutputFile is a 
required argument in the static builder-creation methods. This forces the file 
to be defined at the very beginning of the configuration process, preventing 
the builder instance from being reused for different files.
   
   **Suggested change**
   
   I propose to add empty constructor builders for `ParquetWriter` and 
`ParquetReader` classes.
   
   The builder pattern usually allows configuring all options in different 
stages and order, but guarantees that all required parameters are set and 
consistent for the instances to be created. Why force setting the file at the 
beginning instead of allowing it to be set later?
   
   [This example 
code](https://github.com/apache/parquet-java/blob/4b6fbc1fb636f5553416b6bfd9ce7767ed058bbb/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java#L170):
   
   ```java
   ParquetWriter<Group> writer = ExampleParquetWriter.builder(outputFile)
               .withAllocator(allocator)
               .withCompressionCodec(UNCOMPRESSED)
               .withRowGroupSize(1024)
               .withPageSize(1024)
               .withDictionaryPageSize(512)
               .enableDictionaryEncoding()
               .withValidation(false)
               .withWriterVersion(version)
               .withConf(conf)
               .build();
   ```
   
   Could also be valid as:
   
   ```java
   ExampleParquetWriter.Builder<Group> builder = ExampleParquetWriter.builder()
               .withAllocator(allocator)
               .withCompressionCodec(UNCOMPRESSED)
               .withRowGroupSize(1024)
               .withPageSize(1024)
               .withDictionaryPageSize(512)
               .enableDictionaryEncoding()
               .withValidation(false)
               .withWriterVersion(version)
               .withConf(conf);
   ParquetWriter<Group> writer1 = builder.withFile(outputFile1).build();
   ...
   ParquetWriter<Group> writer2 = builder.withFile(outputFile2).build();
   ```
   
   My proposal is to:
   
   * Introduce no-argument builder() static methods for the ParquetWriter and 
ParquetReader builders
   * Allow setting the file later in the builder configuration process (only 
for InputFile and OutputFile types, leaving apart the deprecated Hadoop Path)
   * Ensure validation is performed within the `build` and `withFile` methods 
to throw an exception if the file has not been set, preserving the builder's 
safety guarantees.
   
   These changes would be fully backward-compatible, as the existing 
`builder(file)` methods would remain untouched.
   
   This would allow for more flexible usage patterns and better support for 
advanced use cases like partitioning and parallel processing in Carpet or even 
as part of Parquet Java.
   
   ### Component(s)
   
   Core


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to