This is an automated email from the ASF dual-hosted git repository. alamb pushed a commit to branch site/external_indexes in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
commit 14b17f6d494ddae1fceca19d126ddda5c0e3af90 Author: Andrew Lamb <and...@nerdnetworks.org> AuthorDate: Fri Aug 8 10:25:11 2025 -0400 updates --- .../blog/2025-08-15-external-parquet-indexes.md | 82 +++++++++++----------- 1 file changed, 42 insertions(+), 40 deletions(-) diff --git a/content/blog/2025-08-15-external-parquet-indexes.md b/content/blog/2025-08-15-external-parquet-indexes.md index 9ccb242..d346285 100644 --- a/content/blog/2025-08-15-external-parquet-indexes.md +++ b/content/blog/2025-08-15-external-parquet-indexes.md @@ -30,7 +30,7 @@ metadata and is limited to indexing structures provided by the format. By caching parsed metadata and using custom external indexes along with the Parquet's hierarchical data organization to significantly speed up query processing. -In this blog, we describe the role of external indexes, caches, and metadata +In this blog, I describe the role of external indexes, caches, and metadata stores in high performance systems, and demonstrate how to apply these concepts to Parquet processing using [Apache DataFusion]. *Note this is an expanded version of the [companion video] and [presentation].* @@ -90,7 +90,7 @@ to further narrow the required data to only those **parts** of each file (e.g. data pages) that are relevant (Step 3). Finally, the system reads only those parts of the file and returns the results to the user (Step 4). -In this blog, we use the term **"index"** to mean any structure that helps +In this blog, I use the term **"index"** to mean any structure that helps locate relevant data during processing, and a high level overview of how external indexes are used to speed up queries is shown in Figure 1. @@ -147,7 +147,7 @@ Examples of locations external indexes can be stored include: # Using Apache Parquet for Storage While the rest of this blog focuses on building custom external indexes using -Parquet and DataFusion, we first briefly discuss why Parquet is a good choice +Parquet and DataFusion, I first briefly discuss why Parquet is a good choice for modern analytic systems. The research community frequently confuses limitations of a particular [implementation of the Parquet format] with the [Parquet Format] itself and it is important to clarify this distinction. @@ -426,7 +426,7 @@ pruning tasks, such as: * Expression simplification ([ExprSimplifier]) for simplifying predicates before applying them to the index. -* Range analysis for predicates ([cp_solver]) for interval based range analysis (e.g. `col > 5 AND col < 10`) +* Range analysis for predicates ([cp_solver]) for interval-based range analysis (e.g. `col > 5 AND col < 10`) [Expr]: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html [APIs for building, vistiting, and rewriting]: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#visiting-and-rewriting-exprs @@ -436,15 +436,15 @@ pruning tasks, such as: # Pruning Parts of Parquet Files with External Indexes -Once the set of files to be scanned has been determined, the next step is -determine which parts of each file can match the query. Similarly to the -previous step, almost all advanced query processing systems use additional +Once the set of files to be scanned has been determined, the next step in the +heirarchal pruning process is to further narrow down the data within each file. +Similarly to the previous step, almost all advanced query processing systems use additional metadata to prune unnecessary parts of the file, such as [Data Skipping Indexes in ClickHouse]. -For Parquet based systems, the most common strategy is to use the built-in metadata such +For Parquet-based systems, the most common strategy is using the built-in metadata such as [min/max statistics], and [Bloom Filters]). However, it is also possible to use external -indexes even for filtering *WITIHIN* Parquet files as shown below. +indexes for filtering *WITIHIN* Parquet files as shown below. [Data Skipping Indexes in ClickHouse]: https://clickhouse.com/docs/optimize/skipping-indexes [min/max statistics]: https://github.com/apache/parquet-format/blob/1dbc814b97c9307687a2e4bee55545ab6a2ef106/src/main/thrift/parquet.thrift#L267 @@ -458,30 +458,31 @@ indexes even for filtering *WITIHIN* Parquet files as shown below. /> **Figure 7**: Step 2: Pruning Parquet Row Groups and Data Pages. Given a query predicate, -systems can use external indexes / metadata stores along with Parquet's built-in -structures to quickly rule out row groups and data pages that cannot match the query. +systems can use external indexes / metadata stores as well as Parquet's built-in +structures to quickly rule out Row Groups and Data Pages that cannot match the query. In this case, the index has ruled out all but three data pages which must then be fetched for more processing. # Pruning Parts of Parquet Files with External Indexes using DataFusion -To implement pruning within Parquet files, you provide a [ParquetAccessPlan] for -each file that tells DataFusion what parts of the file to read. This plan is +To implement pruning within Parquet files, you use the same [`TableProvider`] APIs +as for pruning files. For each file your provider wants to scan, you provide +an additional [ParquetAccessPlan] that tells DataFusion what parts of the file to read. This plan is then [further refined by the DataFusion Parquet reader] using the built-in Parquet metadata to potentially prune additional row groups and data pages -during query execution. You can find a full working example of this technique in -the [advanced_parquet_index.rs] example. +during query execution. You can find a full working example in +the [advanced_parquet_index.rs] example of the DataFusion repository. [ParquetAccessPlan]: https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetAccessPlan.html [further refined by the DataFusion Parquet reader]: https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/source/struct.ParquetSource.html#implementing-external-indexes -Here is how you can build a `ParquetAccessPlan` to scan only specific row groups +Here is how you build a `ParquetAccessPlan` to scan only specific row groups and rows within those row groups. ```rust -// Default to scan all row groups +// Default to scan all (4) row groups let mut access_plan = ParquetAccessPlan::new_all(4); -access_plan.skip(0); // skip row group +access_plan.skip(0); // skip row group 0 // Specify scanning rows 100-200 and 350-400 // in row group 1 that has 1000 rows let row_selection = RowSelection::from(vec![ @@ -496,7 +497,7 @@ access_plan.skip(2); // skip row group 2 // all of row group 3 is scanned by default ``` -The resulting plan looks like this: +The rows that are selected by the resulting plan look like this: ```text ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ @@ -526,10 +527,9 @@ Row Group 2 Row Group 3 ``` -You provide the `ParquetAccessPlan` via a `TableProvider`, similarly to the -previous section. In the `scan` method, you can return an `ExecutionPlan` that -includes the `ParquetAccessPlan` for each file as follows (again, slightly -simplified for clarity): +In the `scan` method, you return an `ExecutionPlan` that includes the +`ParquetAccessPlan` for each file as shows below (again, slightly simplified for +clarity): ```rust impl TableProvider for IndexTableProvider { @@ -581,10 +581,9 @@ impl TableProvider for IndexTableProvider { # Caching Parquet Metadata It is often said that Parquet is unsuitable for low latency query systems -because the footer must be read and parsed for each query, which is simply not -true. Low latency analytic systems are always stateful in practice, with multiple caching layers. -These systems optimize for low latency by caching the parsed metadata in memory, so there -is no need to re-read and re-parse the footer for each query. +because the footer must be read and parsed for each query. This is simply not +true, and **many systems use Parquet for low latency analytics and cache the parsed +metadata in memory to avoid re-reading and re-parsing the footer for each query**. ## Caching Parquet Metadata using DataFusion @@ -593,14 +592,17 @@ example. The example reads and caches the metadata for each file when the index is first built and then uses the cached metadata when reading the files during query execution. -( Note that thanks to [Nuno Faria], the built in [ListingTable] will cache -parsed metadata in the next release of DataFusion (50.0.0). See the [mini epic] -for details). +(Note that thanks to [Nuno Faria], [Jonathan Chen], and [Shehab Amin] the built +in [ListingTable] `TableProvider` included with DataFusion will cache Parquet +metadata in the next release of DataFusion (50.0.0). See the [mini epic] for +details). [advanced_parquet_index.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs [ListingTable]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html [mini epic]: https://github.com/apache/datafusion/issues/17000 [Nuno Faria]: https://nuno-faria.github.io/ +[Jonathan Chen]: https://github.com/jonathanc-n +[Shehab Amin]: https://github.com/shehabgamin To avoid reparsing the metadata, first implement a custom [ParquetFileReaderFactory] as shown below, again slightly simplified for @@ -644,7 +646,7 @@ impl ParquetFileReaderFactory for CachedParquetFileReaderFactory { } ``` -Then, in your [`TableProvider`] use the factory to avoid re-reading the metadata +Then, in your [TableProvider] use the factory to avoid re-reading the metadata for each file: ```rust @@ -686,18 +688,18 @@ impl TableProvider for IndexTableProvider { } ``` - # Conclusion -Parquet has the right structure for high performance analytics and it is -straightforward to build external indexes to speed up queries using DataFusion -without changing the file format. +Parquet has the right structure for high performance analytics via hierarchical +pruning, and it is straightforward to build external indexes to speed up queries +using DataFusion without changing the file format. If you need to build a custom +data platform, it has never been easier to build it with Parquet and DataFusion. -I am a firm believer that data systems of the future will built on a foundation -of modular, high quality, open source components such as Parquet, Arrow and -DataFusion. and we should focus our efforts as a community on improving these -components rather than building new file formats that are optimized for -narrow use cases. +I am a firm believer that data systems of the future will be built on a +foundation of modular, high quality, open source components such as Parquet, +Arrow and DataFusion. and we should focus our efforts as a community on +improving these components rather than building new file formats that are +optimized for narrow use cases. Come Join Us! 🎣 --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org For additional commands, e-mail: commits-h...@datafusion.apache.org