CTTY opened a new issue, #1650:
URL: https://github.com/apache/iceberg-rust/issues/1650

   ### Is your feature request related to a problem or challenge?
   
   Hi icy crustaceans,
   
   While working on adding the 
[`[ClusteredWriter](https://github.com/apache/iceberg-rust/pull/1646)`](https://github.com/apache/iceberg-rust/pull/1646),
 I noticed that the existing `FileWriterBuilder` trait is quite limiting for 
partition-aware writers. The main issue is that the location generator is 
coupled with the low-level file writer (such as `ParquetWriter`), which 
effectively requires `ParquetWriter` itself to be partition-aware.
   
   In addition, [the current `FileWriterBuilder` 
trait](https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/writer/file_writer/mod.rs#L41)
 cannot pass partition information from outer writers to inner writers, making 
it very difficult to implement partitioning writers:
   
   ```rust
   fn build(self) -> impl Future<Output = Result<Self::R>> + Send;
   ```
   
   A hacky workaround would be to have partitioning writers use the concrete 
type `ParquetWriter` as the inner writer, but I think this will lead to 
extensibility issues fairly quickly.
   
   ### Describe the solution you'd like
   
   I see a few aspects of this issue:
   
   * **Decouple `LocationGenerator` and `ParquetWriter` by changing the 
`FileWriterBuilder` trait**
   
   By changing the `FileWriterBuilder` interface to the following, we can make 
`ParquetWriter` focus solely on writing a file without needing to know which 
partition it’s writing to:
   
   ```rust
   fn build(self, output_file: OutputFile) -> impl Future<Output = 
Result<Self::R>> + Send;
   ```
   
   We will also need to adjust `IcebergWriterBuilder` accordingly:
   
   ```rust
   async fn build(self, output_file: OutputFile) -> Result<Self::R>;
   ```
   
   * **Make the functional writer the outermost layer**
   
   After modifying the builder traits, we still need to decide which layer of 
writers should hold the `LocationGenerator`. I propose moving functional 
writers such as `RollingFileWriter` and the upcoming `ClusteredWriter` to the 
outermost layer, and having them own the `LocationGenerator`.
   
   Currently, the layering looks like this:
   
   ```
   // DataFileWriter/IcebergWriter is the outermost layer
   DataFileWriter(
     RollingFileWriter( // currently implements FileWriter
       ParquetWriter()
     )
   )
   ```
   
   My proposal:
   
   ```
   SomeFunctionalWriter( // no longer implements FileWriter
     DataFileWriter(
       ParquetWriter()
     )
   )
   ```
   
   This new design provides several benefits:
   
   1. **Clear layering**: Functional writers hold the `LocationGenerator` and 
generate new `OutputFile`s for the inner writers, while the inner writers focus 
only on writing a single file without worrying about path or location.
   2. **Flexibility**: Functional writers no longer need to implement the 
`FileWriter` trait, making them easier to use and configure. They don’t 
necessarily need to implement any trait, though we could define something like:
   
      ```rust
      FunctionalWriterBuilder {
        async fn build(
          self,
          location_gen: LocationGenerator,
          filename_gen: FileNameGenerator
        ) -> Result<Self::R>;
      }
      ```
   
   I’m still exploring and refining this idea, and would appreciate any 
feedback!
   
   
   ### Willingness to contribute
   
   I would be willing to contribute to this feature with guidance from the 
Iceberg Rust community


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to