alamb commented on code in PR #5457:
URL: https://github.com/apache/arrow-rs/pull/5457#discussion_r1516916704


##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -80,6 +80,32 @@ mod levels;
 ///
 /// assert_eq!(to_write, read);
 /// ```
+///
+/// ## Memory Limiting
+///
+/// The nature of parquet forces buffering of an entire row group before it 
can be flushed

Review Comment:
   Would it be worth suggesting to users that if they want to minimize memory 
overages when writing such data, they can send in smaller `RecordBatches` (e.g. 
split up via `RecordBatch::slice` for example) which gives the parquet writer 
more chances to check / flush?



##########
parquet/src/arrow/async_writer/mod.rs:
##########
@@ -69,6 +69,29 @@ use tokio::io::{AsyncWrite, AsyncWriteExt};
 /// It is implemented based on the sync writer [`ArrowWriter`] with an inner 
buffer.
 /// The buffered data will be flushed to the writer provided by caller when the
 /// buffer's threshold is exceeded.
+///
+/// ## Memory Limiting
+///
+/// The nature of parquet forces buffering of an entire row group before it 
can be flushed
+/// to the underlying writer. This buffering may exceed the configured buffer 
size
+/// of [`AsyncArrowWriter`]. Memory usage can be limited by prematurely 
flushing the row group,
+/// although this will have implications for file size and query performance. 
See [ArrowWriter]
+/// for more information.

Review Comment:
   I agree it would help -- perhaps something like `try_new` in 
`try_new/try_new_with_options`
   
   ```rust
   /// Please see the documentation on [`Self`] for details on memory usage.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to