[GitHub] [arrow-rs] alamb opened a new issue, #4823: API to copy an existing RowGroup, including metadata from one parquet file to another

via GitHub Sat, 16 Sep 2023 03:50:34 -0700


alamb opened a new issue, #4823:
URL: https://github.com/apache/arrow-rs/issues/4823


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   In DataFusion, @devinjdangelo is using the 
[`append_column`](https://docs.rs/parquet/latest/parquet/file/writer/struct.SerializedRowGroupWriter.html#method.append_column)
 API to write parquet files in parallel 
(https://github.com/apache/arrow-datafusion/pull/7562) 
   
   However, when trying to copy the `RowGroupMetadata` to the API to copy any 
bloom filters / page offsets, or others is awkward
   
   
   **Describe the solution you'd like**
   
   I would like a way to to call the `append_column` api given a 
[`RowGroupMetaData`](https://docs.rs/parquet/latest/parquet/file/metadata/struct.RowGroupMetaData.html)
 object from the existing file
   
   Ideally there would be an API that produced a 
[`ColumnCloseResult`](https://docs.rs/parquet/latest/parquet/column/writer/struct.ColumnCloseResult.html)
 from a `RowGroupMetaData` or some convenience API that took a reader + 
RowGroupMetadata from another file and did the necessary copy
   
   Perhaps something like
   
   ```rust
   impl SerializedRowGroupWriter {
   ...
     /// appends an entire RowGroup from the specified reader, including all
     /// metadata, to the in progress parquet file. 
     pub fn append_row_group(&mut self, rg: Box<dyn RowGroupReader>) -> 
Result<...> { 
      ...
     }
   }
   ```
    
   
   
https://docs.rs/parquet/latest/parquet/file/writer/struct.SerializedRowGroupWriter.html#method.append_column
   
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features 
you've considered.
   -->
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] alamb opened a new issue, #4823: API to copy an existing RowGroup, including metadata from one parquet file to another

Reply via email to