alamb commented on code in PR #5485:
URL: https://github.com/apache/arrow-rs/pull/5485#discussion_r1522137448
##########
parquet/src/arrow/async_writer/mod.rs:
##########
@@ -62,19 +61,15 @@ use arrow_array::RecordBatch;
use arrow_schema::SchemaRef;
use tokio::io::{AsyncWrite, AsyncWriteExt};
-/// Async arrow writer.
+/// Encodes [`RecordBatch`] to parquet, outputting to an [`AsyncWrite`]
///
-/// It is implemented based on the sync writer [`ArrowWriter`] with an inner
buffer.
-/// The buffered data will be flushed to the writer provided by caller when the
-/// buffer's threshold is exceeded.
+/// ## Memory Usage
///
-/// ## Memory Limiting
-///
-/// The nature of parquet forces buffering of an entire row group before it
can be flushed
-/// to the underlying writer. This buffering may exceed the configured buffer
size
-/// of [`AsyncArrowWriter`]. Memory usage can be limited by prematurely
flushing the row group,
-/// although this will have implications for file size and query performance.
See [ArrowWriter]
-/// for more information.
+/// The columnar nature of parquet forces buffering data for an entire row
group, as such
+/// [`AsyncArrowWriter`] uses [`ArrowWriter`] to encode each row group in
memory, before
+/// flushing it to the provided [`AsyncWrite`]. Memory usage can be limited by
prematurely
+/// flushing the row group, although this will have implications for file size
and query
+/// performance. See [ArrowWriter] for more information.
Review Comment:
I think it would help to focus this comment on the rationale and
implications for users, rather than implementation.
Something like:
```suggestion
/// Similar to the standard Rust I/O API such as `std::fs::File` this
writer eagerly writes
/// data to the underlying `AsyncWrite` as soon as possible. This permits
fine grained control
/// over buffering and I/O scheduling.
///
/// Note that the columnar nature of parquet forces buffering an entire row
group,
/// before flushing it to the provided [`AsyncWrite`]. Depending on the data
and the configured
/// row group size, the buffer required may be substantial. Memory usage can
be limited by
/// calling [`Self::flush`] to flushing the in progress row group, although
this will likely
/// increase overall file size and reduce query performance. See
[ArrowWriter] for more information.
///
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]