Re: [PR] [BLOG] Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet [datafusion-site]

via GitHub Mon, 11 Aug 2025 09:18:07 -0700


comphead commented on code in PR #99:
URL: https://github.com/apache/datafusion-site/pull/99#discussion_r2267311488



##########
content/blog/2025-08-15-external-parquet-indexes.md:
##########
@@ -0,0 +1,772 @@
+---
+layout: post
+title: Using External Indexes, Metadata Stores, Catalogs and Caches to 
Accelerate Queries on Apache Parquet
+date: 2025-08-15
+author: Andrew Lamb (InfluxData)
+categories: [features]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+
+It is a common misconception that [Apache Parquet] requires (slow) reparsing of
+metadata and is limited to indexing structures provided by the format. In fact,
+caching parsed metadata and using custom external indexes along with
+Parquet's hierarchical data organization can significantly speed up query
+processing.
+
+In this blog, I describe the role of external indexes, caches, and metadata
+stores in high performance systems, and demonstrate how to apply these concepts
+to Parquet processing using [Apache DataFusion]. *Note this is an expanded
+version of the [companion video] and [presentation].*
+
+# Motivation
+
+System designers choose between a pre-configured data system or the often
+daunting task of building their own custom data platform from scratch.
+
+For many users and use cases, one of the existing data systems will
+likely be good enough. However, traditional systems such as [Apache Spark], 
[DuckDB],
+[ClickHouse], [Hive], [Snowflake] are each optimized for a certain set of
+tradeoffs between performance, cost, availability, interoperability, deployment
+target, cloud / on-premises, operational ease and many other factors.
+
+For new, or especially demanding use cases, where no existing system makes your
+optimal tradeoffs, you can build your own custom data platform. Previously this
+was a long and expensive endeavor, but today, in the era of [Composable Data
+Systems], it is increasingly feasible. High quality, open source building 
blocks
+such as [Apache Parquet] for storage, [Apache Arrow] for in-memory processing,
+and [Apache DataFusion] for query execution make it possible to quickly build
+custom data platforms optimized for your specific
+needs<sup>[1](#footnote1)</sup>.
+
+
+
+[companion video]: https://www.youtube.com/watch?v=74YsJT1-Rdk
+[presentation]: 
https://docs.google.com/presentation/d/1e_Z_F8nt2rcvlNvhU11khF5lzJJVqNtqtyJ-G3mp4-Q/edit
+
+[Apache Parquet]: https://parquet.apache.org/
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Arrow]: https://arrow.apache.org/
+[FDAP Stack]: 
https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/
+[Composable Data Systems]: https://www.vldb.org/pvldb/vol16/p2679-pedreira.pdf
+[Apache Spark]: https://spark.apache.org/
+[ClickHouse]: https://clickhouse.com/
+[Hive]: https://hive.apache.org/
+[Snowflake]: https://www.snowflake.com/
+
+
+# Introduction to External Indexes / Catalogs / Metadata Stores / Caches
+
+<div class="text-center">
+<img
+  src="/blog/images/external-parquet-indexes/external-index-overview.png"
+  width="80%"
+  class="img-responsive"
+  alt="Using External Indexes to Accelerate Queries"
+/>
+</div>
+
+**Figure 1**: Using external indexes to speed up queries in an analytic system.
+Given a user's query (Step 1), the system uses an external index (one that is 
not
+stored inline in the data files) to quickly find files that may contain
+relevant data (Step 2). Then, for each file, the system uses the external index
+to further narrow the required data to only those **parts** of each file
+(e.g. data pages) that are relevant (Step 3). Finally, the system reads only 
those
+parts of the file and returns the results to the user (Step 4).
+
+In this blog, I use the term **"index"** to mean any structure that helps
+locate relevant data during processing, and a high level overview of how
+external indexes are used to speed up queries is shown in Figure 1.
+
+All Data Systems typically store both the data itself and additional 
information
+(metadata) to more quickly find data relevant to a query. Metadata is often
+stored in structures with names like "index," "catalog" and "cache" and the
+terminology varies widely across systems. 
+
+There are many different types of indexes, types of content stored in indexes,
+strategies to keep indexes up to date, and ways to apply indexes during query
+processing. These differences each have their own set of tradeoffs, and thus
+different systems understandably make different choices depending on their use
+case. There is no one-size-fits-all solution for indexing. For example, Hive
+uses the [Hive Metastore], [Vertica] uses a purpose-built [Catalog] and open
+data lake systems typically use a table format like [Apache Iceberg] or [Delta
+Lake].
+
+**External Indexes** store information separately ("external") to the data
+itself. External indexes are flexible and widely used, but require additional
+operational overhead to keep in sync with the data files. For example, if you
+add a new Parquet file to your data lake you must also update the relevant
+external index to include information about the new file. Note, it **is**
+possible to avoid external indexes by only using information from the data 
files
+themselves, such as embed user-defined indexes directly in Parquet files,
+described in our previous blog [Embedding User-Defined Indexes in Apache 
Parquet
+Files].
+
+Examples of information commonly stored in external indexes include:
+
+* Min/Max statistics
+* Bloom filters
+* Inverted indexes / Full Text indexes 
+* Information needed to read the remote file (e.g the schema, or Parquet 
footer metadata)
+* Use case specific indexes

Review Comment:
   does that mean use case specific indexes stored inside external indexes? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] [BLOG] Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet [datafusion-site]

Reply via email to