Re: Embedding User-Defined Indexes in Apache Parquet Files

Andrew Lamb Thu, 25 Sep 2025 08:06:07 -0700

Here is a neat demo showing how to use this technique to add a Tantivy full
text index as a user defined index into parquet files:


https://github.com/jcsherin/datablok/tree/main/crates/parquet-embed-tantivy

Quoting from the repo:

Summary of findings:
* The full-text index provides a 2X to 80X speedup for queries that return
zero or very few matching rows.
* When queries return more rows, a full table scan is better. The crossover
point is ~0.04% of the total rows.
* The Parquet file size increases by 80% when we index a single text column
with an average length of 42 (3 - 13 words).
* The geometric mean of speedup across all 36 benchmark queries is 1.90X.
* For instructions on how to run the demo, see the How to Run section below.

# Embedding a Tantivy Index In Parquet

"Parquet tolerates unknown bytes within the file body and permits arbitrary
key/value pairs in its footer metadata. These two features enable embedding
user-defined indexes directly in the file—no extra files, no format forks,
and no compatibility breakage."

>From the DataFusion blog post: Embedding User-Defined Indexes in Apache
Parquet Files

This demo extends a Parquet file by embedding a Tantivy full-text search
index inside it. A custom DataFusion TableProvider implementation uses the
embedded full-text index to optimize wildcard LIKE predicates.

For example:

This demo extends a Parquet file by embedding a Tantivy full-text search
index inside it. A custom DataFusion TableProvider implementation uses the
embedded full-text index to optimize wildcard LIKE predicates.

For example:

SELECT id,
       title
FROM t
WHERE title LIKE '%dairy cow%'





On Fri, Jul 18, 2025 at 1:21 AM Gang Wu <ust...@gmail.com> wrote:

> An orthogonal discussion on the Iceberg dev ML:
> https://lists.apache.org/thread/xdkzllzt4p3tvcd3ft4t7jsvyvztr41j
>
> It proposes to add index support to Iceberg. One use case is to leverage
> Parquet files to store inverted indexes.
>
> Best,
> Gang
>
>
> On Fri, Jul 18, 2025 at 1:32 AM Felipe Oliveira Carvalho <
> felipe...@gmail.com> wrote:
>
> > On Thu, Jul 17, 2025 at 8:18 AM Andrew Lamb <andrewlam...@gmail.com>
> > wrote:
> >
> > > > 1. Parquet file format seems have index page [1], but I don't know
> > who's
> > >
> > > The INDEX_PAGE type a fascinating point -- I am not sure what benefit
> > > writing indexes using that annotation would be 🤔
> > >
> > > > Currently I don't know whether we can have some "offcial" sample
> index.
> > >
> > > I am not sure examples need to be "official" -- I suspect people would
> be
> > > interested in public open source examples of various types of indexes
> > that
> > > they could adapt to their own needs.
> > >
> >
> > Having at least a central location with namespaced index names and links
> to
> > the description of their implementation would help more libraries
> leverage
> > these indexes. Because the lack of standardization means only the
> libraries
> > that produced the files can leverage them. Which defeats the purpose of
> > using an open format somewhat.
> >
> > --
> > Felipe
> >
> >
> > > Andrew
> > >
> > > On Wed, Jul 16, 2025 at 7:16 AM wish maple <maplewish...@gmail.com>
> > wrote:
> > >
> > > > Seems good. Personally I think
> > > >
> > > > 1. Parquet file format seems have index page [1], but I don't know
> > who's
> > > > using it.
> > > > 2. Currently, Parquet only have single column bloom filter and column
> > > > index. Maybe
> > > >     some kind of multi-column or other filter might work
> > > > 3. Index can have different "levels", like Page Index is designed for
> > > > "Page", and bloom
> > > >     filter / statistics for RowGroup. We can even define index for
> > "file"
> > > >
> > > > Currently I don't know whether we can have some "offcial" sample
> index.
> > > > Personally I
> > > > might be interested in some "sketches"
> > > >
> > > > Best,
> > > > Xuwei Fu
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L655
> > > >
> > > > Andrew Lamb <andrewlam...@gmail.com> 于2025年7月16日周三 19:08写道：
> > > >
> > > > > I wrote a blog with Qi Zhu, Jigao Luo explaining how to embed user
> > > > defined
> > > > > indexes into Parquet files without needing any changes to the
> > > format[1].
> > > > >
> > > > > I am sorry for the somewhat shameless self promotion, but I think
> > this
> > > > > topic may be of general interest to the community in the context of
> > > other
> > > > > extensions to the format we have discussed recently. Techniques
> such
> > as
> > > > > this widen potential usecases of  Parquet without any need for
> > > consensus
> > > > or
> > > > > timeline for ecosystem adoption.
> > > > >
> > > > > Andrew
> > > > >
> > > > > [1]:
> > > > >
> > > >
> > >
> >
> https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/
> > > > >
> > > >
> > >
> >
>

Re: Embedding User-Defined Indexes in Apache Parquet Files

Reply via email to