(datafusion-site) 02/05: updates

alamb Fri, 08 Aug 2025 08:59:51 -0700

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch site/external_indexes
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git


commit 14b17f6d494ddae1fceca19d126ddda5c0e3af90
Author: Andrew Lamb <and...@nerdnetworks.org>
AuthorDate: Fri Aug 8 10:25:11 2025 -0400

    updates
---
 .../blog/2025-08-15-external-parquet-indexes.md    | 82 +++++++++++-----------
 1 file changed, 42 insertions(+), 40 deletions(-)

diff --git a/content/blog/2025-08-15-external-parquet-indexes.md 
b/content/blog/2025-08-15-external-parquet-indexes.md
index 9ccb242..d346285 100644
--- a/content/blog/2025-08-15-external-parquet-indexes.md
+++ b/content/blog/2025-08-15-external-parquet-indexes.md
@@ -30,7 +30,7 @@ metadata and is limited to indexing structures provided by 
the format. By
 caching parsed metadata and using custom external indexes along with the
 Parquet's hierarchical data organization to significantly speed up query 
processing.
 
-In this blog, we describe the role of external indexes, caches, and metadata
+In this blog, I describe the role of external indexes, caches, and metadata
 stores in high performance systems, and demonstrate how to apply these concepts
 to Parquet processing using [Apache DataFusion]. *Note this is an expanded
 version of the [companion video] and [presentation].*
@@ -90,7 +90,7 @@ to further narrow the required data to only those **parts** 
of each file
 (e.g. data pages) that are relevant (Step 3). Finally, the system reads only 
those
 parts of the file and returns the results to the user (Step 4).
 
-In this blog, we use the term **"index"** to mean any structure that helps
+In this blog, I use the term **"index"** to mean any structure that helps
 locate relevant data during processing, and a high level overview of how
 external indexes are used to speed up queries is shown in Figure 1.
 
@@ -147,7 +147,7 @@ Examples of locations external indexes can be stored 
include:
 # Using Apache Parquet for Storage
 
 While the rest of this blog focuses on building custom external indexes using
-Parquet and DataFusion, we first briefly discuss why Parquet is a good choice
+Parquet and DataFusion, I first briefly discuss why Parquet is a good choice
 for modern analytic systems. The research community frequently confuses
 limitations of a particular [implementation of the Parquet format] with the
 [Parquet Format] itself and it is important to clarify this distinction.
@@ -426,7 +426,7 @@ pruning tasks, such as:
 
 * Expression simplification ([ExprSimplifier]) for simplifying predicates 
before applying them to the index.
 
-* Range analysis for predicates ([cp_solver]) for interval based range 
analysis (e.g. `col > 5 AND col < 10`)
+* Range analysis for predicates ([cp_solver]) for interval-based range 
analysis (e.g. `col > 5 AND col < 10`)
 
 [Expr]: 
https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html
 [APIs for building, vistiting, and rewriting]: 
https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#visiting-and-rewriting-exprs
@@ -436,15 +436,15 @@ pruning tasks, such as:
 
 # Pruning Parts of Parquet Files with External Indexes
 
-Once the set of files to be scanned has been determined, the next step is
-determine which parts of each file can match the query. Similarly to the
-previous step, almost all advanced query processing systems use additional
+Once the set of files to be scanned has been determined, the next step in the
+heirarchal pruning process is to further narrow down the data within each file.
+Similarly to the previous step, almost all advanced query processing systems 
use additional
 metadata to prune unnecessary parts of the file, such as [Data Skipping Indexes
 in ClickHouse]. 
 
-For Parquet based systems, the most common strategy is to use the built-in 
metadata such
+For Parquet-based systems, the most common strategy is using the built-in 
metadata such
 as [min/max statistics], and [Bloom Filters]). However, it is also possible to 
use external
-indexes even for filtering *WITIHIN* Parquet files as shown below. 
+indexes for filtering *WITIHIN* Parquet files as shown below. 
 
 [Data Skipping Indexes in ClickHouse]: 
https://clickhouse.com/docs/optimize/skipping-indexes
 [min/max statistics]: 
https://github.com/apache/parquet-format/blob/1dbc814b97c9307687a2e4bee55545ab6a2ef106/src/main/thrift/parquet.thrift#L267
@@ -458,30 +458,31 @@ indexes even for filtering *WITIHIN* Parquet files as 
shown below.
 />
 
 **Figure 7**: Step 2: Pruning Parquet Row Groups and Data Pages. Given a query 
predicate,
-systems can use external indexes / metadata stores along with Parquet's 
built-in
-structures to quickly rule out row groups and data pages that cannot match the 
query.
+systems can use external indexes / metadata stores as well as Parquet's 
built-in
+structures to quickly rule out Row Groups and Data Pages that cannot match the 
query.
 In this case, the index has ruled out all but three data pages which must then 
be fetched
 for more processing.
 
 # Pruning Parts of Parquet Files with External Indexes using DataFusion
 
-To implement pruning within Parquet files, you provide a [ParquetAccessPlan] 
for
-each file that tells DataFusion what parts of the file to read. This plan is
+To implement pruning within Parquet files, you use the same [`TableProvider`] 
APIs
+as for pruning files. For each file your provider wants to scan, you provide 
+an additional [ParquetAccessPlan] that tells DataFusion what parts of the file 
to read. This plan is
 then [further refined by the DataFusion Parquet reader] using the built-in
 Parquet metadata to potentially prune additional row groups and data pages
-during query execution. You can find a full working example of this technique 
in
-the [advanced_parquet_index.rs] example.
+during query execution. You can find a full working example in
+the [advanced_parquet_index.rs] example of the DataFusion repository.
 
 [ParquetAccessPlan]: 
https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetAccessPlan.html
 [further refined by the DataFusion Parquet reader]: 
https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/source/struct.ParquetSource.html#implementing-external-indexes
 
-Here is how you can build a `ParquetAccessPlan` to scan only specific row 
groups
+Here is how you build a `ParquetAccessPlan` to scan only specific row groups
 and rows within those row groups. 
 
 ```rust
-// Default to scan all row groups
+// Default to scan all (4) row groups
 let mut access_plan = ParquetAccessPlan::new_all(4);
-access_plan.skip(0); // skip row group
+access_plan.skip(0); // skip row group 0
 // Specify scanning rows 100-200 and 350-400
 // in row group 1 that has 1000 rows
 let row_selection = RowSelection::from(vec![
@@ -496,7 +497,7 @@ access_plan.skip(2); // skip row group 2
 // all of row group 3 is scanned by default
 ```
 
-The resulting plan looks like this:
+The rows that are selected by the resulting plan look like this:
 
 ```text
 ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
@@ -526,10 +527,9 @@ Row Group 2
 Row Group 3
 ```
 
-You provide the `ParquetAccessPlan` via a `TableProvider`, similarly to the
-previous section. In the `scan` method, you can return an `ExecutionPlan` that
-includes the `ParquetAccessPlan` for each file as follows (again, slightly
-simplified for clarity):
+In the `scan` method, you return an `ExecutionPlan` that includes the
+`ParquetAccessPlan` for each file as shows below (again, slightly simplified 
for
+clarity):
 
 ```rust
 impl TableProvider for IndexTableProvider {
@@ -581,10 +581,9 @@ impl TableProvider for IndexTableProvider {
 # Caching Parquet Metadata
 
 It is often said that Parquet is unsuitable for low latency query systems
-because the footer must be read and parsed for each query, which is simply not
-true. Low latency analytic systems are always stateful in practice, with 
multiple caching layers. 
-These systems optimize for low latency by caching the parsed metadata in 
memory, so there
-is no need to re-read and re-parse the footer for each query.
+because the footer must be read and parsed for each query. This is simply not
+true, and **many systems use Parquet for low latency analytics and cache the 
parsed
+metadata in memory to avoid re-reading and re-parsing the footer for each 
query**.
 
 ## Caching Parquet Metadata using DataFusion
 
@@ -593,14 +592,17 @@ example. The example reads and caches the metadata for 
each file when the index
 is first built and then uses the cached metadata when reading the files during
 query execution.
 
-( Note that thanks to [Nuno Faria], the built in [ListingTable] will cache
-parsed metadata in the next release of DataFusion (50.0.0). See the [mini epic]
-for details).
+(Note that thanks to [Nuno Faria], [Jonathan Chen], and [Shehab Amin] the built
+in [ListingTable] `TableProvider` included with DataFusion will cache Parquet
+metadata in the next release of DataFusion (50.0.0). See the [mini epic] for
+details).
 
 [advanced_parquet_index.rs]:  
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
 [ListingTable]: 
https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
 [mini epic]: https://github.com/apache/datafusion/issues/17000
 [Nuno Faria]: https://nuno-faria.github.io/
+[Jonathan Chen]: https://github.com/jonathanc-n
+[Shehab Amin]: https://github.com/shehabgamin
 
 To avoid reparsing the metadata, first implement a custom
 [ParquetFileReaderFactory] as shown below, again slightly simplified for
@@ -644,7 +646,7 @@ impl ParquetFileReaderFactory for 
CachedParquetFileReaderFactory {
 }
 ```
 
-Then, in your [`TableProvider`] use the factory to avoid re-reading the 
metadata
+Then, in your [TableProvider] use the factory to avoid re-reading the metadata
 for each file:
 
 ```rust
@@ -686,18 +688,18 @@ impl TableProvider for IndexTableProvider {
 }
 ```
 
-
 # Conclusion
 
-Parquet has the right structure for high performance analytics and it is
-straightforward to build external indexes to speed up queries using DataFusion
-without changing the file format.
+Parquet has the right structure for high performance analytics via hierarchical
+pruning, and it is straightforward to build external indexes to speed up 
queries
+using DataFusion without changing the file format. If you need to build a 
custom
+data platform, it has never been easier to build it with Parquet and 
DataFusion.
 
-I am a firm believer that data systems of the future will built on a foundation
-of modular, high quality, open source components such as Parquet, Arrow and
-DataFusion. and we should focus our efforts as a community on improving these
-components rather than building new file formats that are optimized for
-narrow use cases.
+I am a firm believer that data systems of the future will be built on a
+foundation of modular, high quality, open source components such as Parquet,
+Arrow and DataFusion. and we should focus our efforts as a community on
+improving these components rather than building new file formats that are
+optimized for narrow use cases.
 
 Come Join Us! 🎣 
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org
For additional commands, e-mail: commits-h...@datafusion.apache.org

(datafusion-site) 02/05: updates

Reply via email to