(arrow-rs) branch main updated: Improve documentation on writing parquet, including multiple threads (#7321)

alamb Wed, 26 Mar 2025 09:45:35 -0700

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git



The following commit(s) were added to refs/heads/main by this push:
     new ac00928899 Improve documentation on writing parquet, including 
multiple threads (#7321)
ac00928899 is described below

commit ac0092889984864f588ee4e6372fb03fdb8e09f8
Author: Andrew Lamb <[email protected]>
AuthorDate: Wed Mar 26 12:44:32 2025 -0400

    Improve documentation on writing parquet, including multiple threads (#7321)
    
    * Improve parquet documentation  on writing with multiple threads
    
    tweak docs
    
    * Apply suggestions from code review
    
    Co-authored-by: Ed Seidl <[email protected]>
    
    ---------
    
    Co-authored-by: Ed Seidl <[email protected]>
---
 parquet/src/arrow/arrow_writer/mod.rs |  8 +++---
 parquet/src/arrow/async_reader/mod.rs | 13 +++++-----
 parquet/src/arrow/async_writer/mod.rs |  8 +++---
 parquet/src/arrow/mod.rs              |  9 +++----
 parquet/src/lib.rs                    | 47 ++++++++++++++++++++++++-----------
 5 files changed, 54 insertions(+), 31 deletions(-)

diff --git a/parquet/src/arrow/arrow_writer/mod.rs 
b/parquet/src/arrow/arrow_writer/mod.rs
index f1081f1481..fb441f4d33 100644
--- a/parquet/src/arrow/arrow_writer/mod.rs
+++ b/parquet/src/arrow/arrow_writer/mod.rs
@@ -59,6 +59,7 @@ mod levels;
 /// flushed on close, leading the final row group in the output file to 
potentially
 /// contain fewer than `max_row_group_size` rows
 ///
+/// # Example: Writing `RecordBatch`es
 /// ```
 /// # use std::sync::Arc;
 /// # use bytes::Bytes;
@@ -80,11 +81,11 @@ mod levels;
 /// assert_eq!(to_write, read);
 /// ```
 ///
-/// ## Memory Limiting
+/// # Memory Usage and Limiting
 ///
-/// The nature of parquet forces buffering of an entire row group before it can
+/// The nature of Parquet requires buffering of an entire row group before it 
can
 /// be flushed to the underlying writer. Data is mostly buffered in its encoded
-/// form, reducing memory usage. However, some data such as dictionary keys or
+/// form, reducing memory usage. However, some data such as dictionary keys,
 /// large strings or very nested data may still result in non-trivial memory
 /// usage.
 ///
@@ -532,6 +533,7 @@ impl ArrowColumnChunk {
 /// Note: This is a low-level interface for applications that require 
fine-grained control
 /// of encoding, see [`ArrowWriter`] for a higher-level interface
 ///
+/// # Example: Encoding two Arrow Array's in Parallel
 /// ```
 /// // The arrow schema
 /// # use std::sync::Arc;
diff --git a/parquet/src/arrow/async_reader/mod.rs 
b/parquet/src/arrow/async_reader/mod.rs
index fac27328e3..3f04a14090 100644
--- a/parquet/src/arrow/async_reader/mod.rs
+++ b/parquet/src/arrow/async_reader/mod.rs
@@ -15,11 +15,9 @@
 // specific language governing permissions and limitations
 // under the License.
 
-//! [`ParquetRecordBatchStreamBuilder`]:  `async` API for reading Parquet 
files as
-//! [`RecordBatch`]es
+//! `async` API for reading Parquet files as [`RecordBatch`]es
 //!
-//! This can be used to decode a Parquet file in streaming fashion (without
-//! downloading the whole file at once) from a remote source, such as an 
object store.
+//! See the [crate-level documentation](crate) for more details.
 //!
 //! See example on [`ParquetRecordBatchStreamBuilder::new`]
 
@@ -269,8 +267,11 @@ pub struct AsyncReader<T>(T);
 
 /// A builder for reading parquet files from an `async` source as  
[`ParquetRecordBatchStream`]
 ///
-/// This builder  handles reading the parquet file metadata, allowing consumers
-/// to use this information to select what specific columns, row groups, etc...
+/// This can be used to decode a Parquet file in streaming fashion (without
+/// downloading the whole file at once) from a remote source, such as an 
object store.
+///
+/// This builder handles reading the parquet file metadata, allowing consumers
+/// to use this information to select what specific columns, row groups, etc.
 /// they wish to be read by the resulting stream.
 ///
 /// See examples on [`ParquetRecordBatchStreamBuilder::new`]
diff --git a/parquet/src/arrow/async_writer/mod.rs 
b/parquet/src/arrow/async_writer/mod.rs
index c04d5710a9..edd4b71ae2 100644
--- a/parquet/src/arrow/async_writer/mod.rs
+++ b/parquet/src/arrow/async_writer/mod.rs
@@ -15,10 +15,12 @@
 // specific language governing permissions and limitations
 // under the License.
 
-//! Contains async writer which writes arrow data into parquet data.
+//! `async` API for writing [`RecordBatch`]es to Parquet files
 //!
-//! Provides `async` API for writing [`RecordBatch`]es as parquet files. The 
API is
-//! similar to the [`sync` API](crate::arrow::arrow_writer::ArrowWriter), so 
please
+//! See the [crate-level documentation](crate) for more details.
+//!
+//! The `async` API for writing [`RecordBatch`]es is
+//! similar to the [`sync` API](ArrowWriter), so please
 //! read the documentation there before using this API.
 //!
 //! Here is an example for using [`AsyncArrowWriter`]:
diff --git a/parquet/src/arrow/mod.rs b/parquet/src/arrow/mod.rs
index e3f09a0dce..b89c6ddcf8 100644
--- a/parquet/src/arrow/mod.rs
+++ b/parquet/src/arrow/mod.rs
@@ -15,14 +15,13 @@
 // specific language governing permissions and limitations
 // under the License.
 
-//! High-level API for reading/writing Arrow
-//! [RecordBatch](arrow_array::RecordBatch)es and
+//! API for reading/writing
+//! Arrow [RecordBatch](arrow_array::RecordBatch)es and
 //! [Array](arrow_array::Array)s to/from Parquet Files.
 //!
-//! [Apache Arrow](http://arrow.apache.org/) is a cross-language development 
platform for
-//! in-memory data.
+//! See the [crate-level documentation](crate) for more details.
 //!
-//!# Example of writing Arrow record batch to Parquet file
+//! # Example of writing Arrow record batch to Parquet file
 //!
 //!```rust
 //! # use arrow_array::{Int32Array, ArrayRef};
diff --git a/parquet/src/lib.rs b/parquet/src/lib.rs
index 86c88ff954..f814ddeb07 100644
--- a/parquet/src/lib.rs
+++ b/parquet/src/lib.rs
@@ -52,28 +52,47 @@
 //! The [`schema`] module provides APIs to work with Parquet schemas. The
 //! [`file::metadata`] module provides APIs to work with Parquet metadata.
 //!
-//! ## Read/Write Arrow
+//! ## Reading and Writing Arrow (`arrow` feature)
 //!
-//! The [`arrow`] module allows reading and writing Parquet data to/from Arrow 
`RecordBatch`.
-//! This makes for a simple and performant interface to parquet data, whilst 
allowing workloads
+//! The [`arrow`] module supports reading and writing Parquet data to/from
+//! Arrow `RecordBatch`es. Using Arrow is simple and performant, and allows 
workloads
 //! to leverage the wide range of data transforms provided by the [arrow] 
crate, and by the
-//! ecosystem of libraries and services using [Arrow] as an interop format.
+//! ecosystem of [Arrow] compatible systems.
 //!
-//! ## Read/Write Arrow Async
+//! Most users will use [`ArrowWriter`] for writing and 
[`ParquetRecordBatchReaderBuilder`] for
+//! reading.
 //!
-//! When the `async` feature is enabled, [`arrow::async_reader`] and 
[`arrow::async_writer`]
-//! provide the ability to read and write [`arrow`] data asynchronously. 
Additionally, with the
-//! `object_store` feature is enabled, 
[`ParquetObjectReader`](arrow::async_reader::ParquetObjectReader)
+//! Lower level APIs include [`ArrowColumnWriter`] for writing using multiple
+//! threads, and [`RowFilter`] to apply filters during decode.
+//!
+//! [`ArrowWriter`]: arrow::arrow_writer::ArrowWriter
+//! [`ParquetRecordBatchReaderBuilder`]: 
arrow::arrow_reader::ParquetRecordBatchReaderBuilder
+//! [`ArrowColumnWriter`]: arrow::arrow_writer::ArrowColumnWriter
+//! [`RowFilter`]: arrow::arrow_reader::RowFilter
+//!
+//! ## `async` Reading and Writing Arrow (`async` feature)
+//!
+//! The [`async_reader`] and [`async_writer`] modules provide async APIs to
+//! read and write `RecordBatch`es  asynchronously.
+//!
+//! Most users will use [`AsyncArrowWriter`] for writing and 
[`ParquetRecordBatchStreamBuilder`]
+//! for reading. When the `object_store` feature is enabled, 
[`ParquetObjectReader`]
 //! provides efficient integration with object storage services such as S3 via 
the [object_store]
 //! crate, automatically optimizing IO based on any predicates or projections 
provided.
 //!
-//! ## Read/Write Parquet
+//! [`async_reader`]: arrow::async_reader
+//! [`async_writer`]: arrow::async_writer
+//! [`AsyncArrowWriter`]: arrow::async_writer::AsyncArrowWriter
+//! [`ParquetRecordBatchStreamBuilder`]: 
arrow::async_reader::ParquetRecordBatchStreamBuilder
+//! [`ParquetObjectReader`]: arrow::async_reader::ParquetObjectReader
+//!
+//! ## Read/Write Parquet Directly
 //!
-//! Workloads needing finer-grained control, or avoid a dependence on arrow,
-//! can use the lower-level APIs in [`mod@file`]. These APIs expose the 
underlying parquet
-//! data model, and therefore require knowledge of the underlying parquet 
format,
-//! including the details of [Dremel] record shredding and [Logical Types]. 
Most workloads
-//! should prefer the arrow interfaces.
+//! Workloads needing finer-grained control, or to avoid a dependence on arrow,
+//! can use the APIs in [`mod@file`] directly. These APIs  are harder to use
+//! as they directly use the underlying Parquet data model, and require 
knowledge
+//! of the Parquet format, including the details of [Dremel] record shredding
+//! and [Logical Types].
 //!
 //! [arrow]: https://docs.rs/arrow/latest/arrow/index.html
 //! [Arrow]: https://arrow.apache.org/

(arrow-rs) branch main updated: Improve documentation on writing parquet, including multiple threads (#7321)

Reply via email to