This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git
The following commit(s) were added to refs/heads/main by this push:
new ac00928899 Improve documentation on writing parquet, including
multiple threads (#7321)
ac00928899 is described below
commit ac0092889984864f588ee4e6372fb03fdb8e09f8
Author: Andrew Lamb <[email protected]>
AuthorDate: Wed Mar 26 12:44:32 2025 -0400
Improve documentation on writing parquet, including multiple threads (#7321)
* Improve parquet documentation on writing with multiple threads
tweak docs
* Apply suggestions from code review
Co-authored-by: Ed Seidl <[email protected]>
---------
Co-authored-by: Ed Seidl <[email protected]>
---
parquet/src/arrow/arrow_writer/mod.rs | 8 +++---
parquet/src/arrow/async_reader/mod.rs | 13 +++++-----
parquet/src/arrow/async_writer/mod.rs | 8 +++---
parquet/src/arrow/mod.rs | 9 +++----
parquet/src/lib.rs | 47 ++++++++++++++++++++++++-----------
5 files changed, 54 insertions(+), 31 deletions(-)
diff --git a/parquet/src/arrow/arrow_writer/mod.rs
b/parquet/src/arrow/arrow_writer/mod.rs
index f1081f1481..fb441f4d33 100644
--- a/parquet/src/arrow/arrow_writer/mod.rs
+++ b/parquet/src/arrow/arrow_writer/mod.rs
@@ -59,6 +59,7 @@ mod levels;
/// flushed on close, leading the final row group in the output file to
potentially
/// contain fewer than `max_row_group_size` rows
///
+/// # Example: Writing `RecordBatch`es
/// ```
/// # use std::sync::Arc;
/// # use bytes::Bytes;
@@ -80,11 +81,11 @@ mod levels;
/// assert_eq!(to_write, read);
/// ```
///
-/// ## Memory Limiting
+/// # Memory Usage and Limiting
///
-/// The nature of parquet forces buffering of an entire row group before it can
+/// The nature of Parquet requires buffering of an entire row group before it
can
/// be flushed to the underlying writer. Data is mostly buffered in its encoded
-/// form, reducing memory usage. However, some data such as dictionary keys or
+/// form, reducing memory usage. However, some data such as dictionary keys,
/// large strings or very nested data may still result in non-trivial memory
/// usage.
///
@@ -532,6 +533,7 @@ impl ArrowColumnChunk {
/// Note: This is a low-level interface for applications that require
fine-grained control
/// of encoding, see [`ArrowWriter`] for a higher-level interface
///
+/// # Example: Encoding two Arrow Array's in Parallel
/// ```
/// // The arrow schema
/// # use std::sync::Arc;
diff --git a/parquet/src/arrow/async_reader/mod.rs
b/parquet/src/arrow/async_reader/mod.rs
index fac27328e3..3f04a14090 100644
--- a/parquet/src/arrow/async_reader/mod.rs
+++ b/parquet/src/arrow/async_reader/mod.rs
@@ -15,11 +15,9 @@
// specific language governing permissions and limitations
// under the License.
-//! [`ParquetRecordBatchStreamBuilder`]: `async` API for reading Parquet
files as
-//! [`RecordBatch`]es
+//! `async` API for reading Parquet files as [`RecordBatch`]es
//!
-//! This can be used to decode a Parquet file in streaming fashion (without
-//! downloading the whole file at once) from a remote source, such as an
object store.
+//! See the [crate-level documentation](crate) for more details.
//!
//! See example on [`ParquetRecordBatchStreamBuilder::new`]
@@ -269,8 +267,11 @@ pub struct AsyncReader<T>(T);
/// A builder for reading parquet files from an `async` source as
[`ParquetRecordBatchStream`]
///
-/// This builder handles reading the parquet file metadata, allowing consumers
-/// to use this information to select what specific columns, row groups, etc...
+/// This can be used to decode a Parquet file in streaming fashion (without
+/// downloading the whole file at once) from a remote source, such as an
object store.
+///
+/// This builder handles reading the parquet file metadata, allowing consumers
+/// to use this information to select what specific columns, row groups, etc.
/// they wish to be read by the resulting stream.
///
/// See examples on [`ParquetRecordBatchStreamBuilder::new`]
diff --git a/parquet/src/arrow/async_writer/mod.rs
b/parquet/src/arrow/async_writer/mod.rs
index c04d5710a9..edd4b71ae2 100644
--- a/parquet/src/arrow/async_writer/mod.rs
+++ b/parquet/src/arrow/async_writer/mod.rs
@@ -15,10 +15,12 @@
// specific language governing permissions and limitations
// under the License.
-//! Contains async writer which writes arrow data into parquet data.
+//! `async` API for writing [`RecordBatch`]es to Parquet files
//!
-//! Provides `async` API for writing [`RecordBatch`]es as parquet files. The
API is
-//! similar to the [`sync` API](crate::arrow::arrow_writer::ArrowWriter), so
please
+//! See the [crate-level documentation](crate) for more details.
+//!
+//! The `async` API for writing [`RecordBatch`]es is
+//! similar to the [`sync` API](ArrowWriter), so please
//! read the documentation there before using this API.
//!
//! Here is an example for using [`AsyncArrowWriter`]:
diff --git a/parquet/src/arrow/mod.rs b/parquet/src/arrow/mod.rs
index e3f09a0dce..b89c6ddcf8 100644
--- a/parquet/src/arrow/mod.rs
+++ b/parquet/src/arrow/mod.rs
@@ -15,14 +15,13 @@
// specific language governing permissions and limitations
// under the License.
-//! High-level API for reading/writing Arrow
-//! [RecordBatch](arrow_array::RecordBatch)es and
+//! API for reading/writing
+//! Arrow [RecordBatch](arrow_array::RecordBatch)es and
//! [Array](arrow_array::Array)s to/from Parquet Files.
//!
-//! [Apache Arrow](http://arrow.apache.org/) is a cross-language development
platform for
-//! in-memory data.
+//! See the [crate-level documentation](crate) for more details.
//!
-//!# Example of writing Arrow record batch to Parquet file
+//! # Example of writing Arrow record batch to Parquet file
//!
//!```rust
//! # use arrow_array::{Int32Array, ArrayRef};
diff --git a/parquet/src/lib.rs b/parquet/src/lib.rs
index 86c88ff954..f814ddeb07 100644
--- a/parquet/src/lib.rs
+++ b/parquet/src/lib.rs
@@ -52,28 +52,47 @@
//! The [`schema`] module provides APIs to work with Parquet schemas. The
//! [`file::metadata`] module provides APIs to work with Parquet metadata.
//!
-//! ## Read/Write Arrow
+//! ## Reading and Writing Arrow (`arrow` feature)
//!
-//! The [`arrow`] module allows reading and writing Parquet data to/from Arrow
`RecordBatch`.
-//! This makes for a simple and performant interface to parquet data, whilst
allowing workloads
+//! The [`arrow`] module supports reading and writing Parquet data to/from
+//! Arrow `RecordBatch`es. Using Arrow is simple and performant, and allows
workloads
//! to leverage the wide range of data transforms provided by the [arrow]
crate, and by the
-//! ecosystem of libraries and services using [Arrow] as an interop format.
+//! ecosystem of [Arrow] compatible systems.
//!
-//! ## Read/Write Arrow Async
+//! Most users will use [`ArrowWriter`] for writing and
[`ParquetRecordBatchReaderBuilder`] for
+//! reading.
//!
-//! When the `async` feature is enabled, [`arrow::async_reader`] and
[`arrow::async_writer`]
-//! provide the ability to read and write [`arrow`] data asynchronously.
Additionally, with the
-//! `object_store` feature is enabled,
[`ParquetObjectReader`](arrow::async_reader::ParquetObjectReader)
+//! Lower level APIs include [`ArrowColumnWriter`] for writing using multiple
+//! threads, and [`RowFilter`] to apply filters during decode.
+//!
+//! [`ArrowWriter`]: arrow::arrow_writer::ArrowWriter
+//! [`ParquetRecordBatchReaderBuilder`]:
arrow::arrow_reader::ParquetRecordBatchReaderBuilder
+//! [`ArrowColumnWriter`]: arrow::arrow_writer::ArrowColumnWriter
+//! [`RowFilter`]: arrow::arrow_reader::RowFilter
+//!
+//! ## `async` Reading and Writing Arrow (`async` feature)
+//!
+//! The [`async_reader`] and [`async_writer`] modules provide async APIs to
+//! read and write `RecordBatch`es asynchronously.
+//!
+//! Most users will use [`AsyncArrowWriter`] for writing and
[`ParquetRecordBatchStreamBuilder`]
+//! for reading. When the `object_store` feature is enabled,
[`ParquetObjectReader`]
//! provides efficient integration with object storage services such as S3 via
the [object_store]
//! crate, automatically optimizing IO based on any predicates or projections
provided.
//!
-//! ## Read/Write Parquet
+//! [`async_reader`]: arrow::async_reader
+//! [`async_writer`]: arrow::async_writer
+//! [`AsyncArrowWriter`]: arrow::async_writer::AsyncArrowWriter
+//! [`ParquetRecordBatchStreamBuilder`]:
arrow::async_reader::ParquetRecordBatchStreamBuilder
+//! [`ParquetObjectReader`]: arrow::async_reader::ParquetObjectReader
+//!
+//! ## Read/Write Parquet Directly
//!
-//! Workloads needing finer-grained control, or avoid a dependence on arrow,
-//! can use the lower-level APIs in [`mod@file`]. These APIs expose the
underlying parquet
-//! data model, and therefore require knowledge of the underlying parquet
format,
-//! including the details of [Dremel] record shredding and [Logical Types].
Most workloads
-//! should prefer the arrow interfaces.
+//! Workloads needing finer-grained control, or to avoid a dependence on arrow,
+//! can use the APIs in [`mod@file`] directly. These APIs are harder to use
+//! as they directly use the underlying Parquet data model, and require
knowledge
+//! of the Parquet format, including the details of [Dremel] record shredding
+//! and [Logical Types].
//!
//! [arrow]: https://docs.rs/arrow/latest/arrow/index.html
//! [Arrow]: https://arrow.apache.org/