This is an automated email from the ASF dual-hosted git repository. alamb pushed a commit to branch site/external_indexes in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
commit e37ec5c8d95e6c06f142bad7bcd56b47cf8d5840 Author: Andrew Lamb <and...@nerdnetworks.org> AuthorDate: Fri Aug 8 10:11:41 2025 -0400 hone --- .../blog/2025-08-15-external-parquet-indexes.md | 37 +++++++++++----------- 1 file changed, 18 insertions(+), 19 deletions(-) diff --git a/content/blog/2025-08-15-external-parquet-indexes.md b/content/blog/2025-08-15-external-parquet-indexes.md index 3299a52..9ccb242 100644 --- a/content/blog/2025-08-15-external-parquet-indexes.md +++ b/content/blog/2025-08-15-external-parquet-indexes.md @@ -356,19 +356,19 @@ external index to find the files that may contain data that matches the query. [supports_filter_pushdown]: https://docs.rs/datafusion/latest/datafusion/datasource/trait.TableProvider.html#method.supports_filters_pushdown [scan]: https://docs.rs/datafusion/latest/datafusion/datasource/trait.TableProvider.html#tymethod.scan -The DataFusion repository contains a fully working and well commented -[parquet_index.rs] example of using an external index to prune files based on a -query predicate. The example demonstrates query that includes the predicate -`value = 150`, and how the `IndexTableProvider` uses the index to determine -that only two files are needed. - -[parquet_index.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs +The DataFusion repository contains a fully working and well-commented example, +[parquet_index.rs], of this technique that you can use as a starting point. +The example creates a simple index that stores the min/max values for a column +called `value` along with the file name. Then it runs the following query: ```sql SELECT file_name, value FROM index_table WHERE value = 150 ``` -The code from the example is as follows (slightly simplified for clarity): +[parquet_index.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs + +The custom `IndexTableProvider`'s `scan` method uses the index to find files +that may contain data matching the predicate as shown below: ```rust impl TableProvider for IndexTableProvider { @@ -409,25 +409,24 @@ impl TableProvider for IndexTableProvider { } ``` -While this example uses a standard min/max index, you can implement any indexing -strategy you need, such as a bloom filters, a full text index, or a more -complex multi-dimensional index. - DataFusion handles the details of pushing down the filters to the -`TableProvider` and the mechanics of reading the parquet files, and you focus on -the system specific details such as building, storing and applying the index. +`TableProvider` and the mechanics of reading the parquet files, so you can focus +on the system specific details such as building, storing and applying the index. +While this example uses a standard min/max index, you can implement any indexing +strategy you need, such as a bloom filters, a full text index, or a more complex +multi-dimensional index. -DataFusion also includes several libraries code to help you with common -filtering tasks, such as: +DataFusion also includes several libraries to help with common filtering and +pruning tasks, such as: * A full and well documented expression representation ([Expr]) and [APIs for building, vistiting, and rewriting] query predicates -* Range Based Pruning ([PruningPredicate]) for cases where your index stores min/max values for some/all columns. +* Range Based Pruning ([PruningPredicate]) for cases where your index stores min/max values. -* Expression simplification ([ExprSimplifier] for simplifying predicates before applying them to the index. +* Expression simplification ([ExprSimplifier]) for simplifying predicates before applying them to the index. -* Range analysis for predicates [cp_solver] for interval based range analysis (e.g. `col > 5 AND col < 10`) +* Range analysis for predicates ([cp_solver]) for interval based range analysis (e.g. `col > 5 AND col < 10`) [Expr]: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html [APIs for building, vistiting, and rewriting]: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#visiting-and-rewriting-exprs --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org For additional commands, e-mail: commits-h...@datafusion.apache.org