(datafusion-site) branch main updated: [BLOG] Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet (#99)

alamb Fri, 15 Aug 2025 03:44:23 -0700

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git



The following commit(s) were added to refs/heads/main by this push:
     new b23bd7a  [BLOG] Using External Indexes, Metadata Stores, Catalogs and 
Caches to Accelerate Queries on Apache Parquet (#99)
b23bd7a is described below

commit b23bd7a26e1df3e2b08cdb03a86e42fe267c2fd6
Author: Andrew Lamb <[email protected]>
AuthorDate: Fri Aug 15 03:44:04 2025 -0700

    [BLOG] Using External Indexes, Metadata Stores, Catalogs and Caches to 
Accelerate Queries on Apache Parquet (#99)
    
    * [BLOG] Using External Indexes and Metadata Stores to Accelerate Queries 
on Apache Parquet
    
    * Apply suggestions from code review
    
    * Fix headings
    
    * Hone background section
    
    * fix link
    
    * Update content/blog/2025-08-15-external-parquet-indexes.md
    
    Co-authored-by: Adam Reeve <[email protected]>
    
    * strengthen first paragraph of background
    
    * wordsmith
    
    * grammar
    
    * improved ascii art
    
    * more grammar
    
    * paren
    
    * pages --> files in diagram
    
    * grammar
    
    * Implement @comphead's suggestion: highlighted the benefit of external 
indexes comparing to file level indexes in terms of IO
    
    * honing
    
    * capitalization OCD
    
    * Add acnknoeldgements section
    
    ---------
    
    Co-authored-by: Adam Reeve <[email protected]>
---
 .../blog/2025-08-15-external-parquet-indexes.md    | 802 +++++++++++++++++++++
 .../external-index-overview.png                    | Bin 0 -> 244699 bytes
 .../parquet-filter-pushdown.png                    | Bin 0 -> 236537 bytes
 .../external-parquet-indexes/parquet-layout.png    | Bin 0 -> 244089 bytes
 .../external-parquet-indexes/parquet-metadata.png  | Bin 0 -> 302935 bytes
 .../processing-pipeline.png                        | Bin 0 -> 95998 bytes
 .../external-parquet-indexes/prune-files.png       | Bin 0 -> 344697 bytes
 .../external-parquet-indexes/prune-row-groups.png  | Bin 0 -> 221498 bytes
 8 files changed, 802 insertions(+)

diff --git a/content/blog/2025-08-15-external-parquet-indexes.md 
b/content/blog/2025-08-15-external-parquet-indexes.md
new file mode 100644
index 0000000..62d4cfe
--- /dev/null
+++ b/content/blog/2025-08-15-external-parquet-indexes.md
@@ -0,0 +1,802 @@
+---
+layout: post
+title: Using External Indexes, Metadata Stores, Catalogs and Caches to 
Accelerate Queries on Apache Parquet
+date: 2025-08-15
+author: Andrew Lamb (InfluxData)
+categories: [features]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<!-- diagrams source 
https://docs.google.com/presentation/d/1e_Z_F8nt2rcvlNvhU11khF5lzJJVqNtqtyJ-G3mp4-Q
 --> 
+
+It is a common misconception that [Apache Parquet] requires (slow) reparsing of
+metadata and is limited to indexing structures provided by the format. In fact,
+caching parsed metadata and using custom external indexes along with
+Parquet's hierarchical data organization can significantly speed up query
+processing.
+
+In this blog, I describe the role of external indexes, caches, and metadata
+stores in high performance systems, and demonstrate how to apply these concepts
+to Parquet processing using [Apache DataFusion]. *Note this is an expanded
+version of the [companion video] and [presentation].*
+
+## Motivation
+
+System designers choose between a pre-configured data system or the often
+daunting task of building their own custom data platform from scratch.
+For many users and use cases, one of the existing data systems will
+likely be good enough. However, traditional systems such as [Apache Spark], 
[DuckDB],
+[ClickHouse], [Hive], or [Snowflake] are each optimized for a certain set of
+tradeoffs between performance, cost, availability, interoperability, deployment
+target, cloud / on-premises, operational ease and many other factors.
+
+For new, or especially demanding use cases, where no existing system makes your
+optimal tradeoffs, you can build your own custom data platform. Previously this
+was a long and expensive endeavor, but today, in the era of [Composable Data
+Systems], it is increasingly feasible. High quality, open source building 
blocks
+such as [Apache Parquet] for storage, [Apache Arrow] for in-memory processing,
+and [Apache DataFusion] for query execution make it possible to quickly build
+custom data platforms optimized for your specific
+needs<sup>[1](#footnote1)</sup>.
+
+
+
+[companion video]: https://www.youtube.com/watch?v=74YsJT1-Rdk
+[presentation]: 
https://docs.google.com/presentation/d/1e_Z_F8nt2rcvlNvhU11khF5lzJJVqNtqtyJ-G3mp4-Q/edit
+
+[Apache Parquet]: https://parquet.apache.org/
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Arrow]: https://arrow.apache.org/
+[FDAP Stack]: 
https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/
+[Composable Data Systems]: https://www.vldb.org/pvldb/vol16/p2679-pedreira.pdf
+[Apache Spark]: https://spark.apache.org/
+[ClickHouse]: https://clickhouse.com/
+[Hive]: https://hive.apache.org/
+[Snowflake]: https://www.snowflake.com/
+
+
+## Introduction to External Indexes / Catalogs / Metadata Stores / Caches
+
+<div class="text-center">
+<img
+  src="/blog/images/external-parquet-indexes/external-index-overview.png"
+  width="80%"
+  class="img-responsive"
+  alt="Using External Indexes to Accelerate Queries"
+/>
+</div>
+
+**Figure 1**: Using external indexes to speed up queries in an analytic system.
+Given a user's query (Step 1), the system uses an external index (one that is 
not
+stored inline in the data files) to quickly find files that may contain
+relevant data (Step 2). Then, for each file, the system uses the external index
+to further narrow the required data to only those **parts** of each file
+(e.g. data pages) that are relevant (Step 3). Finally, the system reads only 
those
+parts of the file and returns the results to the user (Step 4).
+
+In this blog, I use the term **"index"** to mean any structure that helps
+locate relevant data during processing, and a high level overview of how
+external indexes are used to speed up queries is shown in Figure 1.
+
+All data systems typically store both the data itself and additional 
information
+(metadata) to more quickly find data relevant to a query. Metadata is often
+stored in structures with names like "index," "catalog" and "cache" and the
+terminology varies widely across systems. 
+
+There are many different types of indexes, types of content stored in indexes,
+strategies to keep indexes up to date, and ways to apply indexes during query
+processing. These differences each have their own set of tradeoffs, and thus
+different systems understandably make different choices depending on their use
+case. There is no one-size-fits-all solution for indexing. For example, Hive
+uses the [Hive Metastore], [Vertica] uses a purpose-built [Catalog], and open
+data lake systems typically use a table format such as [Apache Iceberg] or 
[Delta
+Lake].
+
+**External Indexes** store information separately ("external") to the data
+itself. External indexes are flexible and widely used, but require additional
+operational overhead to keep in sync with the data files. For example, if you
+add a new Parquet file to your data lake, you must also update the relevant
+external index to include information about the new file. Note, you can
+avoid the operational overhead of external indexes by using only the data files
+themselves, including [Embedding User-Defined Indexes in Apache Parquet
+Files]. However, this approach comes with its own set of tradeoffs such as 
+increased file sizes and the need to update the data files to update the index.
+
+Examples of information commonly stored in external indexes include:
+
+* Min/Max statistics
+* Bloom filters
+* Inverted indexes / Full Text indexes 
+* Information needed to read the remote file (e.g the schema, or Parquet 
footer metadata)
+
+Examples of locations where external indexes can be stored include:
+
+* **Separate files** such as [JSON] or Parquet files.
+* **Transactional databases** such as [PostgreSQL] tables.
+* **Distributed key-value stores** such as [Redis] or [Cassandra].
+* **Local memory** such as an in-memory hash map.
+
+[Hive Metastore]: 
https://cwiki.apache.org/confluence/display/Hive/Design#Design-Metastore
+[Catalog]: 
https://www.vertica.com/docs/latest/HTML/Content/Authoring/AdministratorsGuide/Managing/Metadata/CatalogOverview.htm
+[Apache Iceberg]: https://iceberg.apache.org/
+[Delta Lake]: https://delta.io/
+[Embedding User-Defined Indexes in Apache Parquet Files]: 
https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/
+[JSON]: https://www.json.org/
+[PostgreSQL]: https://www.postgresql.org/
+[Vertica]: https://www.vertica.com/
+[Redis]: https://redis.io/
+[Cassandra]: https://cassandra.apache.org/
+
+## Using Apache Parquet for Storage
+
+While the rest of this blog focuses on building custom external indexes using
+Parquet and DataFusion, I first briefly discuss why Parquet is a good choice 
for
+modern analytic systems. The research community frequently confuses limitations
+of a particular [implementation of the Parquet format] with the [Parquet 
Format]
+itself, and this confusion often obscures capabilities that make Parquet a good
+target for external indexes.
+
+[implementation of the Parquet format]: 
https://parquet.apache.org/docs/file-format/implementationstatus/
+[Parquet Format]: https://parquet.apache.org/docs/file-format/
+
+Apache Parquet's combination of good compression, high-performance, high 
quality
+open source libraries, and wide ecosystem interoperability make it a compelling
+choice when building new systems. While there are some niche use cases that may
+benefit from specialized formats, Parquet is typically the obvious choice.
+While recent proprietary file formats differ in details, they all use the same
+high level structure<sup>[2](#footnote2)</sup> as Parquet: 
+
+1. Metadata (typically at the end  of the file)
+2. Data divided into columns and then into horizontal slices (e.g. Parquet Row 
Groups and/or Data Pages). 
+
+The structure is so widespread because it enables the hierarchical pruning
+approach described in the next section. For example, the native [Clickhouse
+MergeTree] format consists of *Parts* (similar to Parquet files), and 
*Granules*
+(similar to Row Groups). The [Clickhouse indexing strategy] follows a classic
+hierarchical pruning approach that first locates the Parts and then the 
Granules
+that may contain relevant data for the query. This is exactly the same pattern
+as Parquet based systems, which first locate the relevant Parquet files and 
then
+the Row Groups / Data Pages within those files.
+
+[Clickhouse MergeTree]: 
https://clickhouse.com/docs/engines/table-engines/mergetree-family/mergetree
+[Clickhouse indexing strategy]: 
https://clickhouse.com/docs/guides/best-practices/sparse-primary-indexes#clickhouse-index-design
+[Parquet Format]: https://parquet.apache.org/docs/file-format/
+
+A common criticism of using Parquet is that it is not as performant as some new
+proposal. These criticisms typically cherry-pick a few queries and/or datasets
+and build a specialized index or data layout for that specific case. However,
+as I explain in the [companion video] of this blog, even for
+[ClickBench]<sup>[6](#footnote6)</sup>, the current
+benchmaxxing<sup>[3](#footnote3)</sup> target of analytics vendors, there is
+less than a factor of two difference in performance between custom file formats
+and Parquet. The difference becomes even lower when using Parquet files that
+use the full range of existing Parquet features such Column and Offset
+Indexes and Bloom Filters<sup>[7](#footnote7)</sup>. Compared to the low
+interoperability and expensive transcoding/loading step of alternate file
+formats, Parquet is hard to beat.
+
+## Hierarchical Pruning Overview
+
+The key technique for optimizing query processing systems is skipping as
+much data as possible, as quickly as possible. Analytic systems typically use 
a hierarchical
+approach to progressively narrow the set of data that needs to be processed. 
+The standard approach is shown in Figure 2:
+
+1. Entire files are ruled out
+2. Within each file, large sections (e.g. Row Groups) are ruled out
+3. (Optionally) smaller sections (e.g. Data Pages) are ruled out
+4. Finally, the system reads only the relevant data pages and applies the query
+   predicate to the data
+
+<div class="text-center">
+<img 
+  src="/blog/images/external-parquet-indexes/processing-pipeline.png" 
+  width="80%" 
+  class="img-responsive" 
+  alt="Standard Pruning Layers."
+/>
+</div>
+
+**Figure 2**: Hierarchical Pruning: The system first rules out files, then
+Row Groups, then Data Pages, and finally reads only the relevant data pages.
+
+The process is hierarchical because the per-row computation required at the
+earlier stages (e.g. skipping an entire file) is lower than the computation
+required at later stages (apply predicates to the data). 
+As mentioned before, while the details of what metadata is used and how that
+metadata is managed varies substantially across query systems, they almost all
+use a hierarchical pruning strategy.
+
+
+[DuckDB]: https://duckdb.org/
+[Vortex]: https://docs.vortex.dev/
+[ClickBench]: https://clickbench.com/
+[companion video]: https://www.youtube.com/watch?v=74YsJT1-Rdk
+
+## Apache Parquet Overview
+
+This section provides a brief background on the organization of Apache Parquet
+files which is needed to fully understand the sections on implementing 
external indexes.
+If you are already familiar with Parquet, you can skip this section.
+
+Logically, Parquet files are organized into  *Row Groups* and *Column Chunks* 
as
+shown below.
+
+<div class="text-center">
+<img
+  src="/blog/images/external-parquet-indexes/parquet-layout.png"
+  width="80%"
+  class="img-responsive"
+  alt="Logical Parquet File layout: Row Groups and Column Chunks."
+/>
+</div>
+
+**Figure 3**: Logical Parquet File Layout: Data is first divided in horizontal 
slices
+called Row Groups. The data is then stored column by column in *Column Chunks*.
+This arrangement allows efficient access to only the portions of columns needed
+for a query.
+
+Physically, Parquet data is stored as a series of Data Pages along with 
metadata
+stored at the end of the file (in the footer), as shown below.
+
+<div class="text-center">
+<img
+  src="/blog/images/external-parquet-indexes/parquet-metadata.png"
+  width="80%"
+  class="img-responsive"
+  alt="Physical Parquet File layout: Metadata and Footer."
+/>
+</div>
+
+**Figure 4**: Physical Parquet File Layout: A typical Parquet file is composed
+of many data pages,  which contain the raw encoded data, and a footer that
+stores metadata about the file, including the schema and the location of the
+relevant data pages, and optional statistics such as min/max values for each
+Column Chunk.
+
+Parquet files are organized to minimize IO and processing using two key 
mechanisms:
+
+1. **Projection Pushdown**: if a query needs only a subset of columns from a 
table, it
+   only needs to read the pages for the relevant Column Chunks
+
+2. **Filter Pushdown**: Similarly, given a query with a filter predicate such 
as
+   `WHERE C > 25`, query engines can use statistics such as (but not limited 
to)
+   the min/max values stored in the metadata to skip reading and decoding 
pages that
+   cannot possibly match the predicate.
+
+The high level mechanics of Parquet predicate pushdown is shown below:
+
+<div class="text-center">
+<img
+  src="/blog/images/external-parquet-indexes/parquet-filter-pushdown.png"
+  width="80%"
+  class="img-responsive"
+  alt="Parquet Filter Pushdown: use filter predicate to skip pages."
+/>
+</div>
+
+**Figure 5**: Filter Pushdown in Parquet: query engines use the predicate,
+`C > 25`, from the query along with statistics from the metadata, to identify
+pages that may match the predicate which are read for further processing. 
+Please refer to the [Efficient Filter Pushdown] blog for more details.
+**NOTE the exact same pattern can be applied using information from external
+indexes, as described in the next sections.**
+
+
+[Efficient Filter Pushdown]: 
https://datafusion.apache.org/blog/2025/03/21/parquet-pushdown
+
+## Pruning Files with External Indexes
+
+The first step in hierarchical pruning is quickly ruling out files that cannot
+match the query. For example, if a system expects to see queries that
+apply to a time range, it might create an external index to store the minimum
+and maximum `time` values for each file. Then, during query processing, the
+system can quickly rule out files that cannot possibly contain relevant data.
+
+For example, if the user issues a query that only matches the last 7 days of
+data:
+
+```sql
+WHERE time > now() - interval '7 days'
+```
+
+then the index can quickly rule out files that only have data older than the
+most recent 7 days.
+
+<div class="text-center">
+<img
+  src="/blog/images/external-parquet-indexes/prune-files.png"
+  width="80%"
+  class="img-responsive"
+  alt="Data Skipping: Pruning Files."
+/>
+</div>  
+
+**Figure 6**: Step 1: File Pruning. Given a query predicate, systems use 
external
+indexes to quickly rule out files that cannot match the query. In this case, by
+consulting the index all but two files can be ruled out.
+
+External indexes offer much faster lookups and lower I/O overhead than 
Parquet's
+built-in file-level indexes by skipping further processing for many data files.
+Without an external index, systems typically fall back to reading each file's
+footer to find files needed for further processing. Skipping per-file 
processing
+is especially important when reading from remote object stores such as [S3],
+[GCS] or [Azure Blob Store], where each request adds [tens to hundreds of
+milliseconds of latency].
+
+[S3]: https://aws.amazon.com/s3/
+[GCS]: https://cloud.google.com/storage
+[Azure Blob Store]: https://azure.microsoft.com/en-us/services/storage/blobs/
+[tens to hundreds of milliseconds of latency]: 
https://www.vldb.org/pvldb/vol16/p2769-durner.pdf
+
+There are many different systems that use external indexes to find files such 
as 
+[Hive Metadata 
Store](https://cwiki.apache.org/confluence/display/Hive/Design#Design-Metastore),
+[Iceberg](https://iceberg.apache.org/), 
+[Delta Lake](https://delta.io/),
+[DuckLake](https://duckdb.org/2025/05/27/ducklake.html),
+and [Hive Style 
Partitioning](https://sparkbyexamples.com/apache-hive/hive-partitions-explained-with-examples/)<sup>[4](#footnote4)</sup>.
+Of course, each of these systems works well for their intended use cases, but
+if none meets your needs, or you want to experiment with
+different strategies, you can easily build your own external index using
+DataFusion.
+
+### Pruning Files with External Indexes Using DataFusion
+
+To implement file pruning in DataFusion, you implement a custom [TableProvider]
+with the [supports_filter_pushdown] and [scan] methods. The
+`supports_filter_pushdown` method tells DataFusion which predicates can be used
+and the `scan` method uses those predicates with the
+external index to find the files that may contain data that matches the query.
+
+[TableProvider]: 
https://docs.rs/datafusion/latest/datafusion/datasource/trait.TableProvider.html
+[supports_filter_pushdown]: 
https://docs.rs/datafusion/latest/datafusion/datasource/trait.TableProvider.html#method.supports_filters_pushdown
+[scan]: 
https://docs.rs/datafusion/latest/datafusion/datasource/trait.TableProvider.html#tymethod.scan
+
+The DataFusion repository contains a fully working and well-commented example,
+[parquet_index.rs], of this technique that you can use as a starting point. 
+The example creates a simple index that stores the min/max values for a column
+called `value` along with the file name. Then it runs the following query:
+
+```sql
+SELECT file_name, value FROM index_table WHERE value = 150
+```
+
+[parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs
+
+The custom `IndexTableProvider`'s `scan` method uses the index to find files
+that may contain data matching the predicate as shown below:
+
+```rust
+impl TableProvider for IndexTableProvider {
+    async fn scan(
+        &self,
+        state: &dyn Session,
+        projection: Option<&Vec<usize>>,
+        filters: &[Expr],
+        limit: Option<usize>,
+    ) -> Result<Arc<dyn ExecutionPlan>> {
+        let df_schema = DFSchema::try_from(self.schema())?;
+        // Combine all the filters into a single ANDed predicate
+        let predicate = conjunction(filters.to_vec());
+
+        // Use the index to find the files that might have data that matches 
the
+        // predicate. Any file that can not have data that matches the 
predicate
+        // will not be returned.
+        let files = self.index.get_files(predicate.clone())?;
+
+        let object_store_url = ObjectStoreUrl::parse("file://")?;
+        let source = 
Arc::new(ParquetSource::default().with_predicate(predicate));
+        let mut file_scan_config_builder =
+            FileScanConfigBuilder::new(object_store_url, self.schema(), source)
+                .with_projection(projection.cloned())
+                .with_limit(limit);
+
+        // Add the files to the scan config
+        for file in files {
+            file_scan_config_builder = file_scan_config_builder.with_file(
+                PartitionedFile::new(file.path(), file_size.size()),
+            );
+        }
+        Ok(DataSourceExec::from_data_source(
+            file_scan_config_builder.build(),
+        ))
+    }
+    ...
+}
+```
+
+DataFusion handles the details of pushing down the filters to the
+`TableProvider` and the mechanics of reading the Parquet files, so you can 
focus
+on the system specific details such as building, storing, and applying the 
index.
+While this example uses a standard min/max index, you can implement any 
indexing
+strategy you need, such as bloom filters, a full text index, or a more complex
+multidimensional index.
+
+DataFusion also includes several libraries to help with common filtering and
+pruning tasks, such as:
+
+* A full and well documented expression representation ([Expr]) and [APIs for
+  building, visiting, and rewriting] query predicates.
+
+* Range Based Pruning ([PruningPredicate]) for cases where your index stores
+  min/max values.
+
+* Expression simplification ([ExprSimplifier]) for simplifying predicates 
before
+  applying them to the index.
+
+* Range analysis for predicates ([cp_solver]) for interval-based range analysis
+  (e.g. `col > 5 AND col < 10`).
+
+[Expr]: 
https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html
+[APIs for building, visiting, and rewriting]: 
https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#visiting-and-rewriting-exprs
+[PruningPredicate]: 
https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html
+[ExprSimplifier]: 
https://docs.rs/datafusion/latest/datafusion/optimizer/simplify_expressions/struct.ExprSimplifier.html#method.simplify
+[cp_solver]: 
https://docs.rs/datafusion/latest/datafusion/physical_expr/intervals/cp_solver/index.html
+
+## Pruning Parts of Parquet Files with External Indexes
+
+Once the set of files to be scanned has been determined, the next step in the
+hierarchical pruning process is to further narrow down the data within each 
file.
+Similarly to the previous step, almost all advanced query processing systems 
use additional
+metadata to prune unnecessary parts of the file, such as [Data Skipping Indexes
+in ClickHouse]. 
+
+For Parquet-based systems, the most common strategy is using the built-in 
metadata such
+as [min/max statistics] and [Bloom Filters]. However, it is also possible to 
use external
+indexes for filtering *WITHIN* Parquet files as shown below. 
+
+[Data Skipping Indexes in ClickHouse]: 
https://clickhouse.com/docs/optimize/skipping-indexes
+[min/max statistics]: 
https://github.com/apache/parquet-format/blob/1dbc814b97c9307687a2e4bee55545ab6a2ef106/src/main/thrift/parquet.thrift#L267
+[Bloom Filters]: https://parquet.apache.org/docs/file-format/bloomfilter/
+
+<img
+  src="/blog/images/external-parquet-indexes/prune-row-groups.png"
+  width="80%"
+  class="img-responsive"
+  alt="Data Skipping: Pruning Row Groups and DataPages"
+/>
+
+**Figure 7**: Step 2: Pruning Parquet Row Groups and Data Pages. Given a query 
predicate,
+systems can use external indexes / metadata stores as well as Parquet's 
built-in
+structures to quickly rule out Row Groups and Data Pages that cannot match the 
query.
+In this case, the index has ruled out all but three data pages which must then 
be fetched
+for more processing.
+
+## Pruning Parts of Parquet Files with External Indexes using DataFusion
+
+To implement pruning within Parquet files, you use the same [`TableProvider`] 
APIs
+as for pruning files. For each file your provider wants to scan, you provide 
+an additional [ParquetAccessPlan] that tells DataFusion what parts of the file 
to read. This plan is
+then [further refined by the DataFusion Parquet reader] using the built-in
+Parquet metadata to potentially prune additional row groups and data pages
+during query execution. You can find a full working example in
+the [advanced_parquet_index.rs] example of the DataFusion repository.
+
+[ParquetAccessPlan]: 
https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetAccessPlan.html
+[further refined by the DataFusion Parquet reader]: 
https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/source/struct.ParquetSource.html#implementing-external-indexes
+
+Here is how you build a `ParquetAccessPlan` to scan only specific row groups
+and rows within those row groups. 
+
+```rust
+// Default to scan all (4) row groups
+let mut access_plan = ParquetAccessPlan::new_all(4);
+access_plan.skip(0); // skip row group 0
+// Specify scanning rows 100-200 and 350-400
+// in row group 1 that has 1000 rows
+let row_selection = RowSelection::from(vec![
+   RowSelector::skip(100),
+   RowSelector::select(100),
+   RowSelector::skip(150),
+   RowSelector::select(50),
+   RowSelector::skip(600),  // skip last 600 rows
+]);
+access_plan.scan_selection(1, row_selection);
+access_plan.skip(2); // skip row group 2
+// all of row group 3 is scanned by default
+```
+
+The rows that are selected by the resulting plan look like this:
+
+```text
+┌───────────────────┐
+│                   │
+│                   │  SKIP
+│                   │
+└───────────────────┘
+     Row Group 0
+
+┌───────────────────┐
+│ ┌───────────────┐ │  SCAN ONLY ROWS
+│ └───────────────┘ │  100-200
+│ ┌───────────────┐ │  350-400
+│ └───────────────┘ │
+└───────────────────┘
+     Row Group 1
+
+┌───────────────────┐
+│                   │
+│                   │  SKIP
+│                   │
+└───────────────────┘
+     Row Group 2
+
+┌───────────────────┐
+│                   │
+│                   │  SCAN ALL ROWS
+│                   │
+└───────────────────┘
+     Row Group 3
+```
+
+In the `scan` method, you return an `ExecutionPlan` that includes the
+`ParquetAccessPlan` for each file as shown below (again, slightly simplified 
for
+clarity):
+
+```rust
+impl TableProvider for IndexTableProvider {
+    async fn scan(
+        &self,
+        state: &dyn Session,
+        projection: Option<&Vec<usize>>,
+        filters: &[Expr],
+        limit: Option<usize>,
+    ) -> Result<Arc<dyn ExecutionPlan>> {
+        let indexed_file = &self.indexed_file;
+        let predicate = self.filters_to_predicate(state, filters)?;
+
+        // Use the external index to create a starting ParquetAccessPlan
+        // that determines which row groups to scan based on the predicate
+        let access_plan = self.create_plan(&predicate)?;
+
+        let partitioned_file = indexed_file
+            .partitioned_file()
+            // provide the access plan to the DataSourceExec by
+            // storing it as  "extensions" on PartitionedFile
+            .with_extensions(Arc::new(access_plan) as _);
+
+        let file_source = Arc::new(
+            ParquetSource::default()
+                // provide the predicate to the standard DataFusion source as 
well so
+                // DataFusion's Parquet reader will apply row group pruning 
based on
+                // the built-in Parquet metadata (min/max, bloom filters, etc) 
as well
+                .with_predicate(predicate)
+        );
+        let file_scan_config =
+            FileScanConfigBuilder::new(object_store_url, schema, file_source)
+                .with_limit(limit)
+                .with_projection(projection.cloned())
+                .with_file(partitioned_file)
+                .build();
+
+        // Finally, put it all together into a DataSourceExec
+        Ok(DataSourceExec::from_data_source(file_scan_config))
+    }
+    ...
+}
+```
+
+## Caching Parquet Metadata
+
+It is often said that Parquet is unsuitable for low latency query systems
+because the footer must be read and parsed for each query. This is simply not
+true, and **many systems use Parquet for low latency analytics and cache the 
parsed
+metadata in memory to avoid re-reading and re-parsing the footer for each 
query**.
+
+### Caching Parquet Metadata using DataFusion
+
+Reusing cached Parquet Metadata is also shown in the 
[advanced_parquet_index.rs]
+example. The example reads and caches the metadata for each file when the index
+is first built and then uses the cached metadata when reading the files during
+query execution.
+
+(Note that thanks to [Nuno Faria], [Jonathan Chen], and [Shehab Amin] the built
+in [ListingTable] `TableProvider` included with DataFusion will cache Parquet
+metadata in the next release of DataFusion (50.0.0). See the [mini epic] for
+details).
+
+[advanced_parquet_index.rs]:  
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
+[ListingTable]: 
https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
+[mini epic]: https://github.com/apache/datafusion/issues/17000
+[Nuno Faria]: https://nuno-faria.github.io/
+[Jonathan Chen]: https://github.com/jonathanc-n
+[Shehab Amin]: https://github.com/shehabgamin
+
+To avoid reparsing the metadata, first implement a custom
+[ParquetFileReaderFactory] as shown below, again slightly simplified for
+clarity:
+
+[ParquetFileReaderFactory]: 
https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/trait.ParquetFileReaderFactory.html
+
+
+```rust
+impl ParquetFileReaderFactory for CachedParquetFileReaderFactory {
+    fn create_reader(
+        &self,
+        _partition_index: usize,
+        file_meta: FileMeta,
+        metadata_size_hint: Option<usize>,
+        _metrics: &ExecutionPlanMetricsSet,
+    ) -> Result<Box<dyn AsyncFileReader + Send>> {
+        let filename = file_meta.location();
+        
+        // Pass along the information to access the underlying storage
+        // (e.g. S3, GCS, local filesystem, etc)
+        let object_store = Arc::clone(&self.object_store);
+        let mut inner =
+            ParquetObjectReader::new(object_store, 
file_meta.object_meta.location)
+                .with_file_size(file_meta.object_meta.size);
+      
+        // retrieve the pre-parsed metadata from the cache
+        // (which was built when the index was built and is kept in memory)
+        let metadata = self
+            .metadata
+            .get(&filename)
+            .expect("metadata for file not found: {filename}");
+      
+        // Return a ParquetReader that uses the cached metadata
+        Ok(Box::new(ParquetReaderWithCache {
+            filename,
+            metadata: Arc::clone(metadata),
+            inner,
+        }))
+    }
+}
+```
+
+Then, in your [TableProvider] use the factory to avoid re-reading the metadata
+for each file:
+
+```rust
+impl TableProvider for IndexTableProvider {
+    async fn scan(
+        &self,
+        state: &dyn Session,
+        projection: Option<&Vec<usize>>,
+        filters: &[Expr],
+        limit: Option<usize>,
+    ) -> Result<Arc<dyn ExecutionPlan>> {
+        // Configure a factory interface to avoid re-reading the metadata for 
each file
+        let reader_factory =
+            CachedParquetFileReaderFactory::new(Arc::clone(&self.object_store))
+                .with_file(indexed_file);
+
+        // build the partitioned file (see example above for details)
+        let partitioned_file = ...; 
+      
+        // Create the ParquetSource with the predicate and the factory
+        let file_source = Arc::new(
+            ParquetSource::default()
+                // provide the factory to create Parquet reader without 
re-reading metadata
+                .with_parquet_file_reader_factory(Arc::new(reader_factory)),
+        );
+      
+        // Pass along the information needed to read the files
+        let file_scan_config =
+            FileScanConfigBuilder::new(object_store_url, schema, file_source)
+                .with_limit(limit)
+                .with_projection(projection.cloned())
+                .with_file(partitioned_file)
+                .build();
+
+        // Finally, put it all together into a DataSourceExec
+        Ok(DataSourceExec::from_data_source(file_scan_config))
+    }
+    ...
+}
+```
+
+## Conclusion
+
+Parquet has the right structure for high performance analytics via hierarchical
+pruning, and it is straightforward to build external indexes to speed up 
queries
+using DataFusion without changing the file format. If you need to build a 
custom
+data platform, it has never been easier to build it with Parquet and 
DataFusion.
+
+I am a firm believer that data systems of the future will be built on a
+foundation of modular, high quality, open source components such as Parquet,
+Arrow, and DataFusion. We should focus our efforts as a community on
+improving these components rather than building new file formats that are
+optimized for narrow use cases.
+
+Come Join Us! 🎣 
+
+<a href="https://datafusion.apache.org/contributor-guide/communication.html";>
+<img
+  src="/blog/images/logo_original4x.png"
+  width="20%"
+  class="img-responsive"
+  alt="https://datafusion.apache.org/";
+/>
+</a>
+
+
+## About the Author
+
+[Andrew Lamb](https://www.linkedin.com/in/andrewalamb/) is a Staff Engineer at
+[InfluxData](https://www.influxdata.com/), and a member of the [Apache
+DataFusion](https://datafusion.apache.org/) and [Apache 
Arrow](https://arrow.apache.org/) PMCs. He has been working on
+Databases and related systems more than 20 years.
+
+## About DataFusion
+
+[Apache DataFusion] is an extensible query engine toolkit, written
+in Rust, that uses [Apache Arrow] as its in-memory format. DataFusion and
+similar technology are part of the next generation “Deconstructed Database”
+architectures, where new systems are built on a foundation of fast, modular
+components, rather than as a single tightly integrated system.
+
+The [DataFusion community] is always looking for new contributors to help
+improve the project. If you are interested in learning more about how query
+execution works, help document or improve the DataFusion codebase, or just try
+it out, we would love for you to join us.
+
+[Apache Arrow]: https://arrow.apache.org/
+[Apache DataFusion]: https://datafusion.apache.org/
+[DataFusion community]: 
https://datafusion.apache.org/contributor-guide/communication.html
+
+### Acknowledgements
+
+Thank you to [Qi Zhu], [Adam Reeve], [Jigao Luo], [Oleks V], [Shehab Amin], 
[Nuno Faria]
+and [Bruce Ritchie] for their insightful feedback on this blog post.
+
+
+
+
+### Footnotes
+
+<a id="footnote1"></a>`1`: This trend is described in more detail in the [FDAP 
Stack] blog
+
+<a id="footnote2"></a>`2`: This layout is referred to as [PAX in the
+database literature] after the first research paper to describe the technique.
+
+[PAX in the database literature]: https://www.vldb.org/conf/2001/P169.pdf
+
+<a id="footnote3"></a>`3`: Benchmaxxing (verb): to add specific optimizations 
that only
+impact benchmark results and are not widely applicable to real world use cases.
+
+<a id="footnote4"></a>`4`: Hive Style Partitioning is a simple and widely used 
form of indexing based on directory paths, where the directory structure is 
used to
+store information about the data in the files. For example, a directory 
structure like `year=2025/month=08/day=15/` can be used to store data for a 
specific day
+and the system can quickly rule out directories that do not match the query 
predicate.
+
+<a id="footnote5"></a>`5`: I am also convinced that we can speed up the 
process of parsing Parquet footer
+with additional engineering effort (see [Xiangpeng Hao]'s [previous blog on the
+topic]). [Ed Seidl] is beginning this effort. See the [ticket] for details.
+
+<a id="footnote6"></a>`6`: ClickBench includes a wide variety of query patterns
+such as point lookups, filters of different selectivity, and aggregations.
+
+<a id="footnote7"></a>`7`: For example, [Qi Zhu] was able to speed up reads by 
over 2x 
+simply by rewriting the Parquet files with Offset Indexes and no compression 
(see [issue #16149 comment] for details).
+There is likely significant additional performance available by using Bloom 
Filters and resorting the data
+to be clustered in a more optimal way for the queries.
+
+[Qi Zhu]: https://github.com/zhuqi-lucas
+[Adam Reeve]: https://github.com/adamreeve
+[Oleks V]: https://github.com/comphead
+[Jigao Luo]: https://github.com/JigaoLuo
+[Bruce Ritchie]: https://github.com/Omega359
+
+[issue #16149 comment]: 
https://github.com/apache/datafusion/issues/16149#issuecomment-2918761743
+
+[Xiangpeng Hao]: https://xiangpeng.systems/
+[previous blog on the topic]: 
https://www.influxdata.com/blog/how-good-parquet-wide-tables/
+[Ed Seidl]: https://github.com/etseidl
+[ticket]: https://github.com/apache/arrow-rs/issues/5854
diff --git 
a/content/images/external-parquet-indexes/external-index-overview.png 
b/content/images/external-parquet-indexes/external-index-overview.png
new file mode 100644
index 0000000..8906c8a
Binary files /dev/null and 
b/content/images/external-parquet-indexes/external-index-overview.png differ
diff --git 
a/content/images/external-parquet-indexes/parquet-filter-pushdown.png 
b/content/images/external-parquet-indexes/parquet-filter-pushdown.png
new file mode 100644
index 0000000..de408bf
Binary files /dev/null and 
b/content/images/external-parquet-indexes/parquet-filter-pushdown.png differ
diff --git a/content/images/external-parquet-indexes/parquet-layout.png 
b/content/images/external-parquet-indexes/parquet-layout.png
new file mode 100644
index 0000000..cf23b62
Binary files /dev/null and 
b/content/images/external-parquet-indexes/parquet-layout.png differ
diff --git a/content/images/external-parquet-indexes/parquet-metadata.png 
b/content/images/external-parquet-indexes/parquet-metadata.png
new file mode 100644
index 0000000..12c81d1
Binary files /dev/null and 
b/content/images/external-parquet-indexes/parquet-metadata.png differ
diff --git a/content/images/external-parquet-indexes/processing-pipeline.png 
b/content/images/external-parquet-indexes/processing-pipeline.png
new file mode 100644
index 0000000..64e91b9
Binary files /dev/null and 
b/content/images/external-parquet-indexes/processing-pipeline.png differ
diff --git a/content/images/external-parquet-indexes/prune-files.png 
b/content/images/external-parquet-indexes/prune-files.png
new file mode 100644
index 0000000..ba0da7e
Binary files /dev/null and 
b/content/images/external-parquet-indexes/prune-files.png differ
diff --git a/content/images/external-parquet-indexes/prune-row-groups.png 
b/content/images/external-parquet-indexes/prune-row-groups.png
new file mode 100644
index 0000000..4f5d48e
Binary files /dev/null and 
b/content/images/external-parquet-indexes/prune-row-groups.png differ


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion-site) branch main updated: [BLOG] Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet (#99)

Reply via email to