(datafusion) branch main updated: Clarify documentation about gathering statistics for parquet files (#16157)

xudong963 Wed, 28 May 2025 09:24:23 -0700

This is an automated email from the ASF dual-hosted git repository.

xudong963 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion.git



The following commit(s) were added to refs/heads/main by this push:
     new 33a2531f47 Clarify documentation about gathering statistics for 
parquet files (#16157)
33a2531f47 is described below

commit 33a2531f470df26b652fbee394a6dc74da55ff89
Author: Andrew Lamb <and...@nerdnetworks.org>
AuthorDate: Wed May 28 12:24:14 2025 -0400

    Clarify documentation about gathering statistics for parquet files (#16157)
    
    * Improve CREATE EXTERNAL TABLE documatation, add note about statistics
    
    * Add comments to SessionContext::read_parquet and register_parquet
    
    * fmt
    
    * Apply suggestions from code review
    
    Co-authored-by: Oleks V <comph...@users.noreply.github.com>
    
    ---------
    
    Co-authored-by: Oleks V <comph...@users.noreply.github.com>
---
 datafusion/core/src/execution/context/parquet.rs | 23 +++++++++++++++++
 docs/source/user-guide/sql/ddl.md                | 33 ++++++++++++++++++++++--
 2 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/datafusion/core/src/execution/context/parquet.rs 
b/datafusion/core/src/execution/context/parquet.rs
index 23ac6b8848..eea2b80477 100644
--- a/datafusion/core/src/execution/context/parquet.rs
+++ b/datafusion/core/src/execution/context/parquet.rs
@@ -31,6 +31,22 @@ impl SessionContext {
     /// [`read_table`](Self::read_table) with a [`super::ListingTable`].
     ///
     /// For an example, see [`read_csv`](Self::read_csv)
+    ///
+    /// # Note: Statistics
+    ///
+    /// NOTE: by default, statistics are not collected when reading the Parquet
+    /// files as this can slow down the initial DataFrame creation. However,
+    /// collecting statistics can greatly accelerate queries with certain
+    /// filters.
+    ///
+    /// To enable collect statistics, set the [config option]
+    /// `datafusion.execution.collect_statistics` to `true`. See
+    /// [`ConfigOptions`] and [`ExecutionOptions::collect_statistics`] for more
+    /// details.
+    ///
+    /// [config option]: https://datafusion.apache.org/user-guide/configs.html
+    /// [`ConfigOptions`]: crate::config::ConfigOptions
+    /// [`ExecutionOptions::collect_statistics`]: 
crate::config::ExecutionOptions::collect_statistics
     pub async fn read_parquet<P: DataFilePaths>(
         &self,
         table_paths: P,
@@ -41,6 +57,13 @@ impl SessionContext {
 
     /// Registers a Parquet file as a table that can be referenced from SQL
     /// statements executed against this context.
+    ///
+    /// # Note: Statistics
+    ///
+    /// Statistics are not collected by default. See  [`read_parquet`] for more
+    /// details and how to enable them.
+    ///
+    /// [`read_parquet`]: Self::read_parquet
     pub async fn register_parquet(
         &self,
         table_ref: impl Into<TableReference>,
diff --git a/docs/source/user-guide/sql/ddl.md 
b/docs/source/user-guide/sql/ddl.md
index fc18154bec..ff8fa9bac0 100644
--- a/docs/source/user-guide/sql/ddl.md
+++ b/docs/source/user-guide/sql/ddl.md
@@ -82,6 +82,8 @@ For a comprehensive list of format-specific options that can 
be specified in the
 a path to a file or directory of partitioned files locally or on an
 object store.
 
+### Example: Parquet
+
 Parquet data sources can be registered by executing a `CREATE EXTERNAL TABLE` 
SQL statement such as the following. It is not necessary to
 provide schema information for Parquet files.
 
@@ -91,6 +93,23 @@ STORED AS PARQUET
 LOCATION '/mnt/nyctaxi/tripdata.parquet';
 ```
 
+:::{note}
+Statistics
+: By default, when a table is created, DataFusion will _NOT_ read the files
+to gather statistics, which can be expensive but can accelerate subsequent
+queries substantially. If you want to gather statistics
+when creating a table, set the `datafusion.execution.collect_statistics`
+configuration option to `true` before creating the table. For example:
+
+```sql
+SET datafusion.execution.collect_statistics = true;
+```
+
+See the [config settings docs](../configs.md) for more details.
+:::
+
+### Example: Comma Separated Value (CSV)
+
 CSV data sources can also be registered by executing a `CREATE EXTERNAL TABLE` 
SQL statement. The schema will be inferred based on
 scanning a subset of the file.
 
@@ -101,6 +120,8 @@ LOCATION '/path/to/aggregate_simple.csv'
 OPTIONS ('has_header' 'true');
 ```
 
+### Example: Compression
+
 It is also possible to use compressed files, such as `.csv.gz`:
 
 ```sql
@@ -111,6 +132,8 @@ LOCATION '/path/to/aggregate_simple.csv.gz'
 OPTIONS ('has_header' 'true');
 ```
 
+### Example: Specifying Schema
+
 It is also possible to specify the schema manually.
 
 ```sql
@@ -134,6 +157,8 @@ LOCATION '/path/to/aggregate_test_100.csv'
 OPTIONS ('has_header' 'true');
 ```
 
+### Example: Partitioned Tables
+
 It is also possible to specify a directory that contains a partitioned
 table (multiple files with the same schema)
 
@@ -144,7 +169,9 @@ LOCATION '/path/to/directory/of/files'
 OPTIONS ('has_header' 'true');
 ```
 
-With `CREATE UNBOUNDED EXTERNAL TABLE` SQL statement. We can create unbounded 
data sources such as following:
+### Example: Unbounded Data Sources
+
+We can create unbounded data sources using the `CREATE UNBOUNDED EXTERNAL 
TABLE` SQL statement.
 
 ```sql
 CREATE UNBOUNDED EXTERNAL TABLE taxi
@@ -154,6 +181,8 @@ LOCATION '/mnt/nyctaxi/tripdata.parquet';
 
 Note that this statement actually reads data from a fixed-size file, so a 
better example would involve reading from a FIFO file. Nevertheless, once 
Datafusion sees the `UNBOUNDED` keyword in a data source, it tries to execute 
queries that refer to this unbounded source in streaming fashion. If this is 
not possible according to query specifications, plan generation fails stating 
it is not possible to execute given query in streaming fashion. Note that 
queries that can run with unbounded s [...]
 
+### Example: `WITH ORDER` Clause
+
 When creating an output from a data source that is already ordered by
 an expression, you can pre-specify the order of the data using the
 `WITH ORDER` clause. This applies even if the expression used for
@@ -190,7 +219,7 @@ WITH ORDER (sort_expression1 [ASC | DESC] [NULLS { FIRST | 
LAST }]
          [, sort_expression2 [ASC | DESC] [NULLS { FIRST | LAST }] ...])
 ```
 
-### Cautions when using the WITH ORDER Clause
+#### Cautions when using the WITH ORDER Clause
 
 - It's important to understand that using the `WITH ORDER` clause in the 
`CREATE EXTERNAL TABLE` statement only specifies the order in which the data 
should be read from the external file. If the data in the file is not already 
sorted according to the specified order, then the results may not be correct.
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org
For additional commands, e-mail: commits-h...@datafusion.apache.org

(datafusion) branch main updated: Clarify documentation about gathering statistics for parquet files (#16157)

Reply via email to