Re: [PR] Improve `ParquetExec` and related documentation [datafusion]

via GitHub Fri, 24 May 2024 04:41:52 -0700


crepererum commented on code in PR #10647:
URL: https://github.com/apache/datafusion/pull/10647#discussion_r1613322378



##########
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##########
@@ -75,7 +75,79 @@ pub use metrics::ParquetFileMetrics;
 pub use schema_adapter::{SchemaAdapter, SchemaAdapterFactory, SchemaMapper};
 pub use statistics::{RequestedStatistics, StatisticsConverter};
 
-/// Execution plan for scanning one or more Parquet partitions
+/// Execution plan for reading one or more Parquet files.
+///
+/// ```text
+///             ▲
+///             │
+///             │  Produce a stream of
+///             │  RecordBatches
+///             │
+/// ┌───────────────────────┐
+/// │                       │
+/// │      ParquetExec      │
+/// │                       │
+/// └───────────────────────┘
+///             ▲
+///             │  Asynchronously read from one
+///             │  or more parquet files via
+///             │  ObjectStore interface
+///             │
+///             │
+///   .───────────────────.
+///  │                     )
+///  │`───────────────────'│
+///  │    ObjectStore      │
+///  │.───────────────────.│
+///  │                     )
+///   `───────────────────'
+///
+/// ```
+/// # Features
+///
+/// Supports the following optimizations:
+///
+/// * Multi-threaded (aka multi-partition): read from one or more files in
+/// parallel. Can read concurrently from multiple row groups from a single 
file.

Review Comment:
   I would call this "concurrency" instead of "multi-threading". IIRC we don't 
implement ANY threading in this operator and solely rely on tokio to dispatch 
concurrent bits for us. I think it's fine to mention that the concurrency in 
this operator CAN lead to multi-core usage under specific circumstances.



##########
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##########
@@ -75,7 +75,79 @@ pub use metrics::ParquetFileMetrics;
 pub use schema_adapter::{SchemaAdapter, SchemaAdapterFactory, SchemaMapper};
 pub use statistics::{RequestedStatistics, StatisticsConverter};
 
-/// Execution plan for scanning one or more Parquet partitions
+/// Execution plan for reading one or more Parquet files.
+///
+/// ```text
+///             ▲
+///             │
+///             │  Produce a stream of
+///             │  RecordBatches
+///             │
+/// ┌───────────────────────┐
+/// │                       │
+/// │      ParquetExec      │
+/// │                       │
+/// └───────────────────────┘
+///             ▲
+///             │  Asynchronously read from one
+///             │  or more parquet files via
+///             │  ObjectStore interface
+///             │
+///             │
+///   .───────────────────.
+///  │                     )
+///  │`───────────────────'│
+///  │    ObjectStore      │
+///  │.───────────────────.│
+///  │                     )
+///   `───────────────────'
+///
+/// ```
+/// # Features
+///
+/// Supports the following optimizations:
+///
+/// * Multi-threaded (aka multi-partition): read from one or more files in
+/// parallel. Can read concurrently from multiple row groups from a single 
file.
+///
+/// * Predicate push down: skips row groups and pages based on
+/// min/max/null_counts in the row group metadata, the page index and bloom
+/// filters.
+///
+/// * Projection pushdown: reads and decodes only the columns required.
+///
+/// * Limit pushdown: stop execution early after some number of rows are read.
+///
+/// * Custom readers: controls I/O for accessing pages. See

Review Comment:
   ```suggestion
   /// * Custom readers: implements I/O for accessing byte ranges and the 
metadata object. See
   ```
   
   It's not steering the IO process, it's actually responsible for performing 
(or not performing) it. For example, a custom impl. could totally NOT use an 
object store (which is esp. interesting for the metadata bit, see other comment 
below).



##########
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##########
@@ -75,7 +75,79 @@ pub use metrics::ParquetFileMetrics;
 pub use schema_adapter::{SchemaAdapter, SchemaAdapterFactory, SchemaMapper};
 pub use statistics::{RequestedStatistics, StatisticsConverter};
 
-/// Execution plan for scanning one or more Parquet partitions
+/// Execution plan for reading one or more Parquet files.
+///
+/// ```text
+///             ▲
+///             │
+///             │  Produce a stream of
+///             │  RecordBatches
+///             │
+/// ┌───────────────────────┐
+/// │                       │
+/// │      ParquetExec      │
+/// │                       │
+/// └───────────────────────┘
+///             ▲
+///             │  Asynchronously read from one
+///             │  or more parquet files via
+///             │  ObjectStore interface
+///             │
+///             │
+///   .───────────────────.
+///  │                     )
+///  │`───────────────────'│
+///  │    ObjectStore      │
+///  │.───────────────────.│
+///  │                     )
+///   `───────────────────'
+///
+/// ```
+/// # Features
+///
+/// Supports the following optimizations:
+///
+/// * Multi-threaded (aka multi-partition): read from one or more files in
+/// parallel. Can read concurrently from multiple row groups from a single 
file.
+///
+/// * Predicate push down: skips row groups and pages based on
+/// min/max/null_counts in the row group metadata, the page index and bloom
+/// filters.
+///
+/// * Projection pushdown: reads and decodes only the columns required.
+///
+/// * Limit pushdown: stop execution early after some number of rows are read.
+///
+/// * Custom readers: controls I/O for accessing pages. See
+/// [`ParquetFileReaderFactory`] for more details.
+///
+/// * Schema adapters: read parquet files with different schemas into a unified
+/// table schema. This can be used to implement "schema evolution". See
+/// [`SchemaAdapterFactory`] for more details.
+///
+/// * metadata_size_hint: controls the number of bytes read from the end of the

Review Comment:
   FWIW this is passed on to the reader (custom or builtin) and the reader uses 
that to gather the metadata. The reader CAN however use another more precise 
source for this information or not read the metadata from object store at all 
(e.g. it could use an extra service, a dataset-based source or some sort of 
cache).



##########
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##########
@@ -642,11 +714,22 @@ fn should_enable_page_index(
             .unwrap_or(false)
 }
 
-/// Factory of parquet file readers.
+/// Interface for creating [`AsyncFileReader`]s to read parquet files.
+///
+/// This interface is used by [`ParquetOpener`] in order to create readers for
+/// parquet files. Implementations of this trait can be used to provide custom

Review Comment:
   What's "this trait" in this case? I guess you're referring to 
`AsyncFileReader`, not `ParquetFileReaderFactory` here. To avoid confusion and 
give the user more freedom how/where the implement "pre-cached data, I/O ..." 
etc., I suggest to start a new paragraph and say:
   
   ```markdown
   The combined implementations of [`ParquetFileReaderFactory`] and 
[`AsyncFileReader`]
   can be used to provide custom data access operations such as
   pre-cached data, I/O coalescing, etc.
   ```



##########
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##########
@@ -75,7 +75,79 @@ pub use metrics::ParquetFileMetrics;
 pub use schema_adapter::{SchemaAdapter, SchemaAdapterFactory, SchemaMapper};
 pub use statistics::{RequestedStatistics, StatisticsConverter};
 
-/// Execution plan for scanning one or more Parquet partitions
+/// Execution plan for reading one or more Parquet files.
+///
+/// ```text
+///             ▲
+///             │
+///             │  Produce a stream of
+///             │  RecordBatches
+///             │
+/// ┌───────────────────────┐
+/// │                       │
+/// │      ParquetExec      │
+/// │                       │
+/// └───────────────────────┘
+///             ▲
+///             │  Asynchronously read from one
+///             │  or more parquet files via
+///             │  ObjectStore interface
+///             │
+///             │
+///   .───────────────────.
+///  │                     )
+///  │`───────────────────'│
+///  │    ObjectStore      │
+///  │.───────────────────.│
+///  │                     )
+///   `───────────────────'
+///
+/// ```
+/// # Features
+///
+/// Supports the following optimizations:
+///
+/// * Multi-threaded (aka multi-partition): read from one or more files in
+/// parallel. Can read concurrently from multiple row groups from a single 
file.
+///
+/// * Predicate push down: skips row groups and pages based on
+/// min/max/null_counts in the row group metadata, the page index and bloom
+/// filters.
+///
+/// * Projection pushdown: reads and decodes only the columns required.
+///
+/// * Limit pushdown: stop execution early after some number of rows are read.
+///
+/// * Custom readers: controls I/O for accessing pages. See
+/// [`ParquetFileReaderFactory`] for more details.
+///
+/// * Schema adapters: read parquet files with different schemas into a unified
+/// table schema. This can be used to implement "schema evolution". See
+/// [`SchemaAdapterFactory`] for more details.
+///
+/// * metadata_size_hint: controls the number of bytes read from the end of the
+/// file in the initial I/O.
+///
+/// # Execution Overview
+///
+/// * Step 1: [`ParquetExec::execute`] is called, returning a [`FileStream`]
+/// configured to open parquet files with a [`ParquetOpener`].
+///
+/// * Step 2: When the stream is polled, the [`ParquetOpener`] is called to 
open
+/// the file.
+///
+/// * Step 3: The `ParquetOpener` gets the file metadata by reading the footer,
+/// and applies any predicates and projections to determine what pages must be
+/// read.

Review Comment:
   It gets the metadata from the `ParquetFileReaderFactory` or more 
specifically the `AsyncFileReader` that this factory returns. The 
`ParquetOpener` doesn't care where the metadata comes from.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Improve `ParquetExec` and related documentation [datafusion]

Reply via email to