Re: [PR] Remove internal buffering from AsyncArrowWriter (#5484) [arrow-rs]

via GitHub Tue, 12 Mar 2024 14:30:25 -0700


alamb commented on code in PR #5485:
URL: https://github.com/apache/arrow-rs/pull/5485#discussion_r1522137448



##########
parquet/src/arrow/async_writer/mod.rs:
##########
@@ -62,19 +61,15 @@ use arrow_array::RecordBatch;
 use arrow_schema::SchemaRef;
 use tokio::io::{AsyncWrite, AsyncWriteExt};
 
-/// Async arrow writer.
+/// Encodes [`RecordBatch`] to parquet, outputting to an [`AsyncWrite`]
 ///
-/// It is implemented based on the sync writer [`ArrowWriter`] with an inner 
buffer.
-/// The buffered data will be flushed to the writer provided by caller when the
-/// buffer's threshold is exceeded.
+/// ## Memory Usage
 ///
-/// ## Memory Limiting
-///
-/// The nature of parquet forces buffering of an entire row group before it 
can be flushed
-/// to the underlying writer. This buffering may exceed the configured buffer 
size
-/// of [`AsyncArrowWriter`]. Memory usage can be limited by prematurely 
flushing the row group,
-/// although this will have implications for file size and query performance. 
See [ArrowWriter]
-/// for more information.
+/// The columnar nature of parquet forces buffering data for an entire row 
group, as such
+/// [`AsyncArrowWriter`] uses [`ArrowWriter`] to encode each row group in 
memory, before
+/// flushing it to the provided [`AsyncWrite`]. Memory usage can be limited by 
prematurely
+/// flushing the row group, although this will have implications for file size 
and query
+/// performance. See [ArrowWriter] for more information.

Review Comment:
   I think it would help to focus this comment on the rationale and 
implications for users, rather than implementation. 
   
   Something like:
   
   ```suggestion
   ///  Similar to the standard Rust I/O API such as `std::fs::File` this 
writer eagerly writes
   /// data to the underlying `AsyncWrite` as soon as possible. This permits 
fine grained control
   /// over buffering and I/O scheduling.
   ///
   /// Note that the columnar nature of parquet forces buffering an entire row 
group,
   /// before flushing it to the provided [`AsyncWrite`]. Depending on the data 
and the configured 
   /// row group size, the buffer required may be substantial. Memory usage can 
be limited by
   /// calling [`Self::flush`] to flushing the in progress row group, although 
this will likely
   /// increase overall file size and reduce query performance. See 
[ArrowWriter] for more information.
   ///
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Remove internal buffering from AsyncArrowWriter (#5484) [arrow-rs]

Reply via email to