(datafusion-site) 04/05: copilot suggestions

alamb Fri, 08 Aug 2025 08:59:50 -0700

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch site/external_indexes
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git


commit f1a55c3d97fbd3dfe69effaab585379daee492a3
Author: Andrew Lamb <and...@nerdnetworks.org>
AuthorDate: Fri Aug 8 11:21:57 2025 -0400

    copilot suggestions
---
 .../blog/2025-08-15-external-parquet-indexes.md    | 27 +++++++++++-----------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/content/blog/2025-08-15-external-parquet-indexes.md 
b/content/blog/2025-08-15-external-parquet-indexes.md
index 238b9d3..bcd5d09 100644
--- a/content/blog/2025-08-15-external-parquet-indexes.md
+++ b/content/blog/2025-08-15-external-parquet-indexes.md
@@ -26,9 +26,10 @@ limitations under the License.
 
 
 It is a common misconception that [Apache Parquet] requires (slow) reparsing of
-metadata and is limited to indexing structures provided by the format. By
-caching parsed metadata and using custom external indexes along with the
-Parquet's hierarchical data organization to significantly speed up query 
processing.
+metadata and is limited to indexing structures provided by the format. In fact,
+caching parsed metadata and using custom external indexes along with
+Parquet's hierarchical data organization can significantly speed up query
+processing.
 
 In this blog, I describe the role of external indexes, caches, and metadata
 stores in high performance systems, and demonstrate how to apply these concepts
@@ -104,7 +105,7 @@ strategies to keep indexes up to date, and ways to apply 
indexes during query
 processing. These differences each have their own set of tradeoffs, and thus
 different systems understandably make different choices depending on their use
 case. There is no one-size-fits-all solution for indexing. For example, Hive
-uses the [Hive Metastore], [Vertica] uses a purpose built-in [Catalog] and open
+uses the [Hive Metastore], [Vertica] uses a purpose-built [Catalog] and open
 data lake systems typically use a table format like [Apache Iceberg] or [Delta
 Lake].
 
@@ -115,7 +116,7 @@ add a new Parquet file to your data lake you must also 
update the relevant
 external index to include information about the new file. Note, it **is**
 possible to avoid external indexes by only using information from the data 
files
 themselves, such as embed user-defined indexes directly in Parquet files,
-describe our previous blog [Embedding User-Defined Indexes in Apache Parquet
+described in our previous blog [Embedding User-Defined Indexes in Apache 
Parquet
 Files].
 
 Examples of information commonly stored in external indexes include:
@@ -157,7 +158,7 @@ limitations of a particular [implementation of the Parquet 
format] with the
 
 Apache Parquet's combination of good compression, high-performance, high 
quality
 open source libraries, and wide ecosystem interoperability make it a compelling
-choice when building new systems. While there are some niche use case that may
+choice when building new systems. While there are some niche use cases that may
 benefit from specialized formats, Parquet is typically the obvious choice.
 While recent proprietary file formats differ in details, they all use the same
 high level structure<sup>[2](#footnote2)</sup>: 
@@ -169,7 +170,7 @@ The structure is so widespread because it enables the 
hierarchical pruning
 approach described in the next section. For example, the native [Clickhouse
 MergeTree] format consists of *Parts* (similar to Parquet files), and 
*Granules*
 (similar to Row Groups). The [Clickhouse indexing strategy] follows a classic
-heirarchal pruning approach that first locates the Parts and then the Granules
+hierarchical pruning approach that first locates the Parts and then the 
Granules
 that may contain relevant data for the query. This is exactly the same pattern
 as Parquet based systems, which first locate the relevant Parquet files and 
then
 the Row Groups / Data Pages within those files.
@@ -304,7 +305,7 @@ indexes, as described in the next sections.**
 
 # Pruning Files with External Indexes
 
-The first step in heirarchal pruning is quickly ruling out files that cannot
+The first step in hierarchical pruning is quickly ruling out files that cannot
 match the query.  For example, if a system expects to have see queries that
 apply to a time range, it might create an external index to store the minimum
 and maximum `time` values for each file. Then, during query processing, the
@@ -420,7 +421,7 @@ DataFusion also includes several libraries to help with 
common filtering and
 pruning tasks, such as:
 
 * A full and well documented expression representation ([Expr]) and [APIs for
-  building, vistiting, and rewriting] query predicates
+  building, visiting, and rewriting] query predicates
 
 * Range Based Pruning ([PruningPredicate]) for cases where your index stores 
min/max values.
 
@@ -429,7 +430,7 @@ pruning tasks, such as:
 * Range analysis for predicates ([cp_solver]) for interval-based range 
analysis (e.g. `col > 5 AND col < 10`)
 
 [Expr]: 
https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html
-[APIs for building, vistiting, and rewriting]: 
https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#visiting-and-rewriting-exprs
+[APIs for building, visiting, and rewriting]: 
https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#visiting-and-rewriting-exprs
 [PruningPredicate]: 
https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html
 [ExprSimplifier]: 
https://docs.rs/datafusion/latest/datafusion/optimizer/simplify_expressions/struct.ExprSimplifier.html#method.simplify
 [cp_solver]: 
https://docs.rs/datafusion/latest/datafusion/physical_expr/intervals/cp_solver/index.html
@@ -437,14 +438,14 @@ pruning tasks, such as:
 # Pruning Parts of Parquet Files with External Indexes
 
 Once the set of files to be scanned has been determined, the next step in the
-heirarchal pruning process is to further narrow down the data within each file.
+hierarchical pruning process is to further narrow down the data within each 
file.
 Similarly to the previous step, almost all advanced query processing systems 
use additional
 metadata to prune unnecessary parts of the file, such as [Data Skipping Indexes
 in ClickHouse]. 
 
 For Parquet-based systems, the most common strategy is using the built-in 
metadata such
 as [min/max statistics], and [Bloom Filters]). However, it is also possible to 
use external
-indexes for filtering *WITIHIN* Parquet files as shown below. 
+indexes for filtering *WITHIN* Parquet files as shown below. 
 
 [Data Skipping Indexes in ClickHouse]: 
https://clickhouse.com/docs/optimize/skipping-indexes
 [min/max statistics]: 
https://github.com/apache/parquet-format/blob/1dbc814b97c9307687a2e4bee55545ab6a2ef106/src/main/thrift/parquet.thrift#L267
@@ -753,7 +754,7 @@ and the system can quickly rule out directories that do not 
match the query pred
 
 <a id="footnote5"></a>`5`: I am also convinced that we can speed up the 
process of parsing Parquet footer
 with additional engineering effort (see [Xiangpeng Hao]'s [previous blog on the
-topic]). [Ed Seidl] is begining this effort. See the [ticket] for details.
+topic]). [Ed Seidl] is beginning this effort. See the [ticket] for details.
 
 <a id="footnote6"></a>`6`: ClickBench includes a wide variety of query patterns
 such as point lookups, filters of different selectivity, and aggregations.


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org
For additional commands, e-mail: commits-h...@datafusion.apache.org

(datafusion-site) 04/05: copilot suggestions

Reply via email to