This is an automated email from the ASF dual-hosted git repository. alamb pushed a commit to branch site/external_indexes in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
commit f1a55c3d97fbd3dfe69effaab585379daee492a3 Author: Andrew Lamb <and...@nerdnetworks.org> AuthorDate: Fri Aug 8 11:21:57 2025 -0400 copilot suggestions --- .../blog/2025-08-15-external-parquet-indexes.md | 27 +++++++++++----------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/content/blog/2025-08-15-external-parquet-indexes.md b/content/blog/2025-08-15-external-parquet-indexes.md index 238b9d3..bcd5d09 100644 --- a/content/blog/2025-08-15-external-parquet-indexes.md +++ b/content/blog/2025-08-15-external-parquet-indexes.md @@ -26,9 +26,10 @@ limitations under the License. It is a common misconception that [Apache Parquet] requires (slow) reparsing of -metadata and is limited to indexing structures provided by the format. By -caching parsed metadata and using custom external indexes along with the -Parquet's hierarchical data organization to significantly speed up query processing. +metadata and is limited to indexing structures provided by the format. In fact, +caching parsed metadata and using custom external indexes along with +Parquet's hierarchical data organization can significantly speed up query +processing. In this blog, I describe the role of external indexes, caches, and metadata stores in high performance systems, and demonstrate how to apply these concepts @@ -104,7 +105,7 @@ strategies to keep indexes up to date, and ways to apply indexes during query processing. These differences each have their own set of tradeoffs, and thus different systems understandably make different choices depending on their use case. There is no one-size-fits-all solution for indexing. For example, Hive -uses the [Hive Metastore], [Vertica] uses a purpose built-in [Catalog] and open +uses the [Hive Metastore], [Vertica] uses a purpose-built [Catalog] and open data lake systems typically use a table format like [Apache Iceberg] or [Delta Lake]. @@ -115,7 +116,7 @@ add a new Parquet file to your data lake you must also update the relevant external index to include information about the new file. Note, it **is** possible to avoid external indexes by only using information from the data files themselves, such as embed user-defined indexes directly in Parquet files, -describe our previous blog [Embedding User-Defined Indexes in Apache Parquet +described in our previous blog [Embedding User-Defined Indexes in Apache Parquet Files]. Examples of information commonly stored in external indexes include: @@ -157,7 +158,7 @@ limitations of a particular [implementation of the Parquet format] with the Apache Parquet's combination of good compression, high-performance, high quality open source libraries, and wide ecosystem interoperability make it a compelling -choice when building new systems. While there are some niche use case that may +choice when building new systems. While there are some niche use cases that may benefit from specialized formats, Parquet is typically the obvious choice. While recent proprietary file formats differ in details, they all use the same high level structure<sup>[2](#footnote2)</sup>: @@ -169,7 +170,7 @@ The structure is so widespread because it enables the hierarchical pruning approach described in the next section. For example, the native [Clickhouse MergeTree] format consists of *Parts* (similar to Parquet files), and *Granules* (similar to Row Groups). The [Clickhouse indexing strategy] follows a classic -heirarchal pruning approach that first locates the Parts and then the Granules +hierarchical pruning approach that first locates the Parts and then the Granules that may contain relevant data for the query. This is exactly the same pattern as Parquet based systems, which first locate the relevant Parquet files and then the Row Groups / Data Pages within those files. @@ -304,7 +305,7 @@ indexes, as described in the next sections.** # Pruning Files with External Indexes -The first step in heirarchal pruning is quickly ruling out files that cannot +The first step in hierarchical pruning is quickly ruling out files that cannot match the query. For example, if a system expects to have see queries that apply to a time range, it might create an external index to store the minimum and maximum `time` values for each file. Then, during query processing, the @@ -420,7 +421,7 @@ DataFusion also includes several libraries to help with common filtering and pruning tasks, such as: * A full and well documented expression representation ([Expr]) and [APIs for - building, vistiting, and rewriting] query predicates + building, visiting, and rewriting] query predicates * Range Based Pruning ([PruningPredicate]) for cases where your index stores min/max values. @@ -429,7 +430,7 @@ pruning tasks, such as: * Range analysis for predicates ([cp_solver]) for interval-based range analysis (e.g. `col > 5 AND col < 10`) [Expr]: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html -[APIs for building, vistiting, and rewriting]: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#visiting-and-rewriting-exprs +[APIs for building, visiting, and rewriting]: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#visiting-and-rewriting-exprs [PruningPredicate]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html [ExprSimplifier]: https://docs.rs/datafusion/latest/datafusion/optimizer/simplify_expressions/struct.ExprSimplifier.html#method.simplify [cp_solver]: https://docs.rs/datafusion/latest/datafusion/physical_expr/intervals/cp_solver/index.html @@ -437,14 +438,14 @@ pruning tasks, such as: # Pruning Parts of Parquet Files with External Indexes Once the set of files to be scanned has been determined, the next step in the -heirarchal pruning process is to further narrow down the data within each file. +hierarchical pruning process is to further narrow down the data within each file. Similarly to the previous step, almost all advanced query processing systems use additional metadata to prune unnecessary parts of the file, such as [Data Skipping Indexes in ClickHouse]. For Parquet-based systems, the most common strategy is using the built-in metadata such as [min/max statistics], and [Bloom Filters]). However, it is also possible to use external -indexes for filtering *WITIHIN* Parquet files as shown below. +indexes for filtering *WITHIN* Parquet files as shown below. [Data Skipping Indexes in ClickHouse]: https://clickhouse.com/docs/optimize/skipping-indexes [min/max statistics]: https://github.com/apache/parquet-format/blob/1dbc814b97c9307687a2e4bee55545ab6a2ef106/src/main/thrift/parquet.thrift#L267 @@ -753,7 +754,7 @@ and the system can quickly rule out directories that do not match the query pred <a id="footnote5"></a>`5`: I am also convinced that we can speed up the process of parsing Parquet footer with additional engineering effort (see [Xiangpeng Hao]'s [previous blog on the -topic]). [Ed Seidl] is begining this effort. See the [ticket] for details. +topic]). [Ed Seidl] is beginning this effort. See the [ticket] for details. <a id="footnote6"></a>`6`: ClickBench includes a wide variety of query patterns such as point lookups, filters of different selectivity, and aggregations. --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org For additional commands, e-mail: commits-h...@datafusion.apache.org