Re: [PR] Expand parquet crate overview doc [arrow-rs]

via GitHub Sat, 18 Nov 2023 11:56:19 -0800


tustvold commented on code in PR #5093:
URL: https://github.com/apache/arrow-rs/pull/5093#discussion_r1398263148



##########
parquet/README.md:
##########
@@ -71,14 +71,6 @@ The `parquet` crate provides the following features which 
may be enabled in your
 - [x] Predicate pushdown
 - [x] Parquet format 4.0.0 support
 
-## Support for `wasm32-unknown-unknown` target

Review Comment:
   This is out-dated and no longer true so I took the opportunity to remove it



##########
parquet/src/arrow/async_reader/store.rs:
##########
@@ -28,42 +28,30 @@ use crate::arrow::async_reader::{AsyncFileReader, 
MetadataLoader};
 use crate::errors::{ParquetError, Result};
 use crate::file::metadata::ParquetMetaData;
 
-/**
-Reads Parquet files in object storage using [`ObjectStore`].
-
-Example reading a Parquet file from Azure Blob Storage
-
-```rust
-use object_store::azure::MicrosoftAzureBuilder;
-use object_store::path::Path;
-use object_store::ObjectStore;
-# use parquet::arrow::async_reader::ParquetObjectReader;
-# use parquet::arrow::ParquetRecordBatchStreamBuilder;
-# use parquet::schema::printer::print_parquet_metadata;
-# use std::error::Error;
-# use std::io::stdout;
-# use std::sync::Arc;
-
-// Open Azure Storage Blob 
https://myaccount.blob.core.windows.net/mycontainer/path/to/blob.parquet
-// Requires running Azure CLI `az login` beforehand to setup authentication
-let storage_container = Arc::new(
-    MicrosoftAzureBuilder::new()
-        .with_account("myaccount")
-        .with_container_name("mycontainer")
-        .with_use_azure_cli(true)
-        .build()?);
-let blob = storage_container
-    .get(&Path::from("path/to/blob.parquet"))
-    .await?
-    .meta;
-println!("Found Blob with {}B at {}", blob.size, blob.location);
-
-// Show Parquet metadata
-let reader = ParquetObjectReader::new(storage_container, blob);
-let builder = ParquetRecordBatchStreamBuilder::new(reader).await?;
-print_parquet_metadata(&mut stdout(), builder.metadata());
-```
-*/
+/// Reads Parquet files in object storage using [`ObjectStore`].
+///
+/// ```no_run
+/// # use std::io::stdout;
+/// # use std::sync::Arc;
+/// # use object_store::azure::MicrosoftAzureBuilder;
+/// # use object_store::ObjectStore;
+/// # use object_store::path::Path;
+/// # use parquet::arrow::async_reader::ParquetObjectReader;
+/// # use parquet::arrow::ParquetRecordBatchStreamBuilder;
+/// # use parquet::schema::printer::print_parquet_metadata;
+/// # async fn main() {
+/// // Populate configuration from environment
+/// let storage_container = 
Arc::new(MicrosoftAzureBuilder::from_env().build()?);

Review Comment:
   I tweaked this to use from_env to cut down on boilerplate



##########
parquet/src/lib.rs:
##########
@@ -15,70 +15,67 @@
 // specific language governing permissions and limitations
 // under the License.
 
-/*!
-This crate contains the official Native Rust implementation of
-[Apache Parquet](https://parquet.apache.org/), part of
-the [Apache Arrow](https://arrow.apache.org/) project.
-The crate provides a number of APIs to read and write Parquet files,
-covering a range of use cases.
-
-Please see the [parquet crates.io](https://crates.io/crates/parquet)
-page for feature flags and tips to improve performance.
-
-# Getting Started
-
-## Format Overview
-
-Parquet is a columnar format, which means that unlike row formats like
-the [CSV format](https://en.wikipedia.org/wiki/Comma-separated_values) for 
instance,
-values are iterated along columns instead of rows. Parquet is similar in 
spirit to
-[Arrow](https://arrow.apache.org/), with Parquet focusing on storage
-efficiency while Arrow prioritizes compute efficiency.
-
-Parquet files are partitioned for scalability. Each file contains metadata,
-along with zero or more "row groups", each row group containing one or
-more columns. The APIs in this crate reflect this structure.
-
-Parquet distinguishes between "logical" and "physical" data types.
-For instance, strings (logical type) are stored as byte arrays (physical type).
-Likewise, temporal types like dates, times, timestamps, etc. (logical type)
-are stored as integers (physical type). This crate exposes both kinds of types.
-
-For more details about the Parquet format, see the
-[Parquet 
spec](https://github.com/apache/parquet-format/blob/master/README.md#file-format).
-
-## APIs
-
-This crate exposes both low-level and high-level APIs, organized as follows:
-
-1. The [`arrow`] module reads and writes Parquet data to/from Arrow
-`RecordBatch`es. This is the recommended high-level API. It allows leveraging
-the wide range of data transforms provided by the
-[arrow](https://docs.rs/arrow/latest/arrow/index.html) crate and by the 
ecosystem
-of libraries and services using Arrow as a interop format.
-
-2. The [`mod@file`] module allows reading and writing Parquet files without 
taking a
-dependency on Arrow. This is the recommended low-level API. Parquet files are
-read and written one row group at a time by calling respectively
-[`SerializedFileReader::get_row_group`](file::serialized_reader::SerializedFileReader)
- and 
[`SerializedFileWriter::next_row_group`](file::writer::SerializedFileWriter).
-Within each row group, columns are read and written one at a time using
-respectively [`ColumnReader`](column::reader::ColumnReader) and
-[`ColumnWriter`](column::writer::ColumnWriter). The [`mod@file`] module also 
allows
-reading files in a row-wise manner via
-[`SerializedFileReader::get_row_iter`](file::serialized_reader::SerializedFileReader).
-This is a convenience API which favors simplicity over performance and 
completeness.
-It is not recommended for production use.

Review Comment:
   I opted to move this into a caveats section on RowIter and expand a bit more 
on the why



##########
parquet/Cargo.toml:
##########
@@ -115,11 +115,6 @@ name = "async_read_parquet"
 required-features = ["arrow", "async"]
 path = "./examples/async_read_parquet.rs"
 
-[[example]]

Review Comment:
   I opted to just include the doctest as an example



##########
parquet/src/arrow/async_reader/store.rs:
##########
@@ -28,42 +28,30 @@ use crate::arrow::async_reader::{AsyncFileReader, 
MetadataLoader};
 use crate::errors::{ParquetError, Result};
 use crate::file::metadata::ParquetMetaData;
 
-/**
-Reads Parquet files in object storage using [`ObjectStore`].
-
-Example reading a Parquet file from Azure Blob Storage
-
-```rust
-use object_store::azure::MicrosoftAzureBuilder;
-use object_store::path::Path;
-use object_store::ObjectStore;
-# use parquet::arrow::async_reader::ParquetObjectReader;
-# use parquet::arrow::ParquetRecordBatchStreamBuilder;
-# use parquet::schema::printer::print_parquet_metadata;
-# use std::error::Error;
-# use std::io::stdout;
-# use std::sync::Arc;
-
-// Open Azure Storage Blob 
https://myaccount.blob.core.windows.net/mycontainer/path/to/blob.parquet
-// Requires running Azure CLI `az login` beforehand to setup authentication
-let storage_container = Arc::new(
-    MicrosoftAzureBuilder::new()
-        .with_account("myaccount")
-        .with_container_name("mycontainer")
-        .with_use_azure_cli(true)
-        .build()?);
-let blob = storage_container
-    .get(&Path::from("path/to/blob.parquet"))
-    .await?
-    .meta;
-println!("Found Blob with {}B at {}", blob.size, blob.location);
-
-// Show Parquet metadata
-let reader = ParquetObjectReader::new(storage_container, blob);
-let builder = ParquetRecordBatchStreamBuilder::new(reader).await?;
-print_parquet_metadata(&mut stdout(), builder.metadata());
-```
-*/
+/// Reads Parquet files in object storage using [`ObjectStore`].
+///
+/// ```no_run
+/// # use std::io::stdout;
+/// # use std::sync::Arc;
+/// # use object_store::azure::MicrosoftAzureBuilder;
+/// # use object_store::ObjectStore;
+/// # use object_store::path::Path;
+/// # use parquet::arrow::async_reader::ParquetObjectReader;
+/// # use parquet::arrow::ParquetRecordBatchStreamBuilder;
+/// # use parquet::schema::printer::print_parquet_metadata;
+/// # async fn main() {
+/// // Populate configuration from environment
+/// let storage_container = 
Arc::new(MicrosoftAzureBuilder::from_env().build()?);
+/// let location = Path::from("path/to/blob.parquet");
+/// let meta = storage_container.head(&location).await?;

Review Comment:
   Head is a more efficient way to obtain the ObjectMeta, as it avoids fetching 
any data



##########
parquet/src/lib.rs:
##########
@@ -15,70 +15,67 @@
 // specific language governing permissions and limitations
 // under the License.
 
-/*!
-This crate contains the official Native Rust implementation of
-[Apache Parquet](https://parquet.apache.org/), part of
-the [Apache Arrow](https://arrow.apache.org/) project.
-The crate provides a number of APIs to read and write Parquet files,
-covering a range of use cases.
-
-Please see the [parquet crates.io](https://crates.io/crates/parquet)
-page for feature flags and tips to improve performance.
-
-# Getting Started
-
-## Format Overview
-
-Parquet is a columnar format, which means that unlike row formats like
-the [CSV format](https://en.wikipedia.org/wiki/Comma-separated_values) for 
instance,
-values are iterated along columns instead of rows. Parquet is similar in 
spirit to
-[Arrow](https://arrow.apache.org/), with Parquet focusing on storage
-efficiency while Arrow prioritizes compute efficiency.
-
-Parquet files are partitioned for scalability. Each file contains metadata,
-along with zero or more "row groups", each row group containing one or
-more columns. The APIs in this crate reflect this structure.
-
-Parquet distinguishes between "logical" and "physical" data types.
-For instance, strings (logical type) are stored as byte arrays (physical type).
-Likewise, temporal types like dates, times, timestamps, etc. (logical type)
-are stored as integers (physical type). This crate exposes both kinds of types.
-
-For more details about the Parquet format, see the
-[Parquet 
spec](https://github.com/apache/parquet-format/blob/master/README.md#file-format).
-
-## APIs
-
-This crate exposes both low-level and high-level APIs, organized as follows:
-
-1. The [`arrow`] module reads and writes Parquet data to/from Arrow
-`RecordBatch`es. This is the recommended high-level API. It allows leveraging
-the wide range of data transforms provided by the
-[arrow](https://docs.rs/arrow/latest/arrow/index.html) crate and by the 
ecosystem
-of libraries and services using Arrow as a interop format.
-
-2. The [`mod@file`] module allows reading and writing Parquet files without 
taking a
-dependency on Arrow. This is the recommended low-level API. Parquet files are
-read and written one row group at a time by calling respectively
-[`SerializedFileReader::get_row_group`](file::serialized_reader::SerializedFileReader)
- and 
[`SerializedFileWriter::next_row_group`](file::writer::SerializedFileWriter).
-Within each row group, columns are read and written one at a time using
-respectively [`ColumnReader`](column::reader::ColumnReader) and
-[`ColumnWriter`](column::writer::ColumnWriter). The [`mod@file`] module also 
allows
-reading files in a row-wise manner via
-[`SerializedFileReader::get_row_iter`](file::serialized_reader::SerializedFileReader).
-This is a convenience API which favors simplicity over performance and 
completeness.
-It is not recommended for production use.
-
-3. Within the [`arrow`] module, async reading and writing is provided by
-[`arrow::async_reader`] and [`arrow::async_writer`]. These APIs are more 
advanced and
-require the `async` feature. Within this module,
-[`ParquetObjectReader`](arrow::async_reader::ParquetObjectReader)
-enables connecting directly to the Cloud Provider storage services of AWS, 
Azure, GCP
-and the likes via the 
[object_store](https://docs.rs/object_store/latest/object_store/)
-crate, enabling network-bandwidth optimizations via predicate and projection 
push-downs.
-
-*/
+//!
+//! This crate contains the official Native Rust implementation of

Review Comment:
   I tweaked the copy slightly to be a bit more concise and easier to parse



##########
parquet/src/file/mod.rs:
##########
@@ -15,7 +15,7 @@
 // specific language governing permissions and limitations
 // under the License.
 
-//! Main entrypoint for working with Parquet API.
+//! Low level APIs for reading raw parquet data.

Review Comment:
   I did not realise this was here, this was deeply misleading :smile: 



##########
parquet/src/lib.rs:
##########
@@ -15,70 +15,67 @@
 // specific language governing permissions and limitations
 // under the License.
 
-/*!
-This crate contains the official Native Rust implementation of
-[Apache Parquet](https://parquet.apache.org/), part of
-the [Apache Arrow](https://arrow.apache.org/) project.
-The crate provides a number of APIs to read and write Parquet files,
-covering a range of use cases.
-
-Please see the [parquet crates.io](https://crates.io/crates/parquet)
-page for feature flags and tips to improve performance.
-
-# Getting Started
-
-## Format Overview
-
-Parquet is a columnar format, which means that unlike row formats like
-the [CSV format](https://en.wikipedia.org/wiki/Comma-separated_values) for 
instance,
-values are iterated along columns instead of rows. Parquet is similar in 
spirit to
-[Arrow](https://arrow.apache.org/), with Parquet focusing on storage
-efficiency while Arrow prioritizes compute efficiency.
-
-Parquet files are partitioned for scalability. Each file contains metadata,
-along with zero or more "row groups", each row group containing one or
-more columns. The APIs in this crate reflect this structure.
-
-Parquet distinguishes between "logical" and "physical" data types.
-For instance, strings (logical type) are stored as byte arrays (physical type).
-Likewise, temporal types like dates, times, timestamps, etc. (logical type)
-are stored as integers (physical type). This crate exposes both kinds of types.
-
-For more details about the Parquet format, see the
-[Parquet 
spec](https://github.com/apache/parquet-format/blob/master/README.md#file-format).
-
-## APIs
-
-This crate exposes both low-level and high-level APIs, organized as follows:
-
-1. The [`arrow`] module reads and writes Parquet data to/from Arrow
-`RecordBatch`es. This is the recommended high-level API. It allows leveraging
-the wide range of data transforms provided by the
-[arrow](https://docs.rs/arrow/latest/arrow/index.html) crate and by the 
ecosystem
-of libraries and services using Arrow as a interop format.
-
-2. The [`mod@file`] module allows reading and writing Parquet files without 
taking a
-dependency on Arrow. This is the recommended low-level API. Parquet files are
-read and written one row group at a time by calling respectively
-[`SerializedFileReader::get_row_group`](file::serialized_reader::SerializedFileReader)

Review Comment:
   I opted to de-emphasise these APIs, as they're low-level and I think most 
people should just steer clear of them



##########
parquet/src/lib.rs:
##########
@@ -15,70 +15,67 @@
 // specific language governing permissions and limitations
 // under the License.
 
-/*!
-This crate contains the official Native Rust implementation of
-[Apache Parquet](https://parquet.apache.org/), part of
-the [Apache Arrow](https://arrow.apache.org/) project.
-The crate provides a number of APIs to read and write Parquet files,
-covering a range of use cases.
-
-Please see the [parquet crates.io](https://crates.io/crates/parquet)
-page for feature flags and tips to improve performance.
-
-# Getting Started
-
-## Format Overview
-
-Parquet is a columnar format, which means that unlike row formats like
-the [CSV format](https://en.wikipedia.org/wiki/Comma-separated_values) for 
instance,
-values are iterated along columns instead of rows. Parquet is similar in 
spirit to
-[Arrow](https://arrow.apache.org/), with Parquet focusing on storage
-efficiency while Arrow prioritizes compute efficiency.
-
-Parquet files are partitioned for scalability. Each file contains metadata,
-along with zero or more "row groups", each row group containing one or
-more columns. The APIs in this crate reflect this structure.
-
-Parquet distinguishes between "logical" and "physical" data types.
-For instance, strings (logical type) are stored as byte arrays (physical type).
-Likewise, temporal types like dates, times, timestamps, etc. (logical type)
-are stored as integers (physical type). This crate exposes both kinds of types.
-
-For more details about the Parquet format, see the
-[Parquet 
spec](https://github.com/apache/parquet-format/blob/master/README.md#file-format).
-
-## APIs
-
-This crate exposes both low-level and high-level APIs, organized as follows:
-
-1. The [`arrow`] module reads and writes Parquet data to/from Arrow

Review Comment:
   I split these into headed sections, as they felt a bit meaty to be bullets



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Expand parquet crate overview doc [arrow-rs]

Reply via email to