[GitHub] [arrow-rs] alamb commented on a diff in pull request #4280: Improve `ArrowWriter` memory usage: Buffer Pages in ArrowWriter instead of RecordBatch (#3871)

via GitHub Sat, 27 May 2023 02:54:03 -0700


alamb commented on code in PR #4280:
URL: https://github.com/apache/arrow-rs/pull/4280#discussion_r1207836196



##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -152,43 +147,75 @@ impl<W: Write> ArrowWriter<W> {
         self.writer.flushed_row_groups()
     }
 
-    /// Enqueues the provided `RecordBatch` to be written
+    /// Returns the length in bytes of the current in progress row group
+    pub fn in_progress_size(&self) -> usize {

Review Comment:
   To be clear here, the usecase is "I want to ensure that when writing many 
parquet files concurrently we don't exceed some memory limit (so that the 
process doing so isn't killed by k8s / the operating system)
   
   This doesn't need to be super accurate, but enough of an upper bound to 
achieve the above goal.
   
   If it would be ok to add a 1MB overhead for each column (e.g. the 
PAGE_SIZE)or wherever that buffer is defined, I can try and propose a patch to 
do so.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] alamb commented on a diff in pull request #4280: Improve `ArrowWriter` memory usage: Buffer Pages in ArrowWriter instead of RecordBatch (#3871)

Reply via email to