[I] Better memory limiting in parquet `ArrowWriter` [arrow-rs]

via GitHub Thu, 07 Mar 2024 14:32:34 -0800


alamb opened a new issue, #5484:
URL: https://github.com/apache/arrow-rs/issues/5484


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   @DDtKey suggested in https://github.com/apache/arrow-rs/pull/5457 
https://github.com/apache/arrow-rs/pull/5457#pullrequestreview-1913224197
   
   **Describe the solution you'd like**
   
   > I still think would be nice to have an additional config(or method) to 
"enforce flush on buffer size". To be able to encapsulate this logic for user's 
code 🤔
   
   The idea is to add an additional option to force the writer to flush when 
its buffered data hits a certain limit. 
   
   **Describe alternatives you've considered**
   
   The challenge is how to enforce buffer limiting without slowing down 
encoding.  One idea would be to checking memory usage after completing encoding 
each RecordBatch. This would be imprecise (the writer could go over), as noted 
by @tustvold , but the overage would be bounded to the size of one RecordBatch 
(which the user could control)
   
   Since all writers 
   This might look like adding something like this to the 
[ArrowWriter](https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html):
   
   ```rust
   let mut writer = ArrowWriter::try_new(&mut buffer, to_write.schema(), None)
     .unwrap()
     // flush when buffered parquet data exceeds 10MB
     .with_target_buffer_size(10*1024*1024)
   ```
   
   Since not all the parquet writers buffer their data like this, I think it 
doesn't make sense to put the buffer size on the `WriterProperties` struct. 
   
   
   
   **Additional context**
   @tustvold  documented the current behavior better in 
https://github.com/apache/arrow-rs/pull/5457 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Better memory limiting in parquet `ArrowWriter` [arrow-rs]

Reply via email to