alamb commented on code in PR #4280:
URL: https://github.com/apache/arrow-rs/pull/4280#discussion_r1207836196
##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -152,43 +147,75 @@ impl<W: Write> ArrowWriter<W> {
self.writer.flushed_row_groups()
}
- /// Enqueues the provided `RecordBatch` to be written
+ /// Returns the length in bytes of the current in progress row group
+ pub fn in_progress_size(&self) -> usize {
Review Comment:
To be clear here, the usecase is "I want to ensure that when writing many
parquet files concurrently we don't exceed some memory limit (so that the
process doing so isn't killed by k8s / the operating system)
This doesn't need to be super accurate, but enough of an upper bound to
achieve the above goal.
If it would be ok to add a 1MB overhead for each column (e.g. the
PAGE_SIZE)or wherever that buffer is defined, I can try and propose a patch to
do so.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]