richox opened a new issue, #4477:
URL: https://github.com/apache/arrow-rs/issues/4477

   **Describe the bug**
   <!--
   A clear and concise description of what the bug is.
   -->
   
   when writing big parquet files using `AsyncArrowWriter`, we found that the 
memory usage is unexpectedly high, and sometimes makes the process run out of 
memory.
   
   the bug is likely in the following code. it tried to trigger flushing once 
the buffer size reaches half of the capacity. however, when data is written 
into buffer, the capacity also increases along with size. so this condition is 
not working expectedly.
   
   
https://github.com/apache/arrow-rs/blob/aac3aa99398c4f4fe59c60d1839d3a8ab60d00f3/parquet/src/arrow/async_writer/mod.rs#L145
   
   **To Reproduce**
   <!--
   Steps to reproduce the behavior:
   -->
   
   read a big parquet file, then write to another file with `AsyncArrowWriter`. 
since reading is ususally faster than writing. data will be buffered but not 
correctly flushed, causing OOM.
   
   **Expected behavior**
   <!--
   A clear and concise description of what you expected to happen.
   -->
   
   trigger flushing with the constant initial buffer capacity.
   
   **Additional context**
   <!--
   Add any other context about the problem here.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to