Here is a neat demo showing how to use this technique to add a Tantivy full text index as a user defined index into parquet files:
https://github.com/jcsherin/datablok/tree/main/crates/parquet-embed-tantivy Quoting from the repo: Summary of findings: * The full-text index provides a 2X to 80X speedup for queries that return zero or very few matching rows. * When queries return more rows, a full table scan is better. The crossover point is ~0.04% of the total rows. * The Parquet file size increases by 80% when we index a single text column with an average length of 42 (3 - 13 words). * The geometric mean of speedup across all 36 benchmark queries is 1.90X. * For instructions on how to run the demo, see the How to Run section below. # Embedding a Tantivy Index In Parquet "Parquet tolerates unknown bytes within the file body and permits arbitrary key/value pairs in its footer metadata. These two features enable embedding user-defined indexes directly in the file—no extra files, no format forks, and no compatibility breakage." >From the DataFusion blog post: Embedding User-Defined Indexes in Apache Parquet Files This demo extends a Parquet file by embedding a Tantivy full-text search index inside it. A custom DataFusion TableProvider implementation uses the embedded full-text index to optimize wildcard LIKE predicates. For example: This demo extends a Parquet file by embedding a Tantivy full-text search index inside it. A custom DataFusion TableProvider implementation uses the embedded full-text index to optimize wildcard LIKE predicates. For example: SELECT id, title FROM t WHERE title LIKE '%dairy cow%' On Fri, Jul 18, 2025 at 1:21 AM Gang Wu <ust...@gmail.com> wrote: > An orthogonal discussion on the Iceberg dev ML: > https://lists.apache.org/thread/xdkzllzt4p3tvcd3ft4t7jsvyvztr41j > > It proposes to add index support to Iceberg. One use case is to leverage > Parquet files to store inverted indexes. > > Best, > Gang > > > On Fri, Jul 18, 2025 at 1:32 AM Felipe Oliveira Carvalho < > felipe...@gmail.com> wrote: > > > On Thu, Jul 17, 2025 at 8:18 AM Andrew Lamb <andrewlam...@gmail.com> > > wrote: > > > > > > 1. Parquet file format seems have index page [1], but I don't know > > who's > > > > > > The INDEX_PAGE type a fascinating point -- I am not sure what benefit > > > writing indexes using that annotation would be 🤔 > > > > > > > Currently I don't know whether we can have some "offcial" sample > index. > > > > > > I am not sure examples need to be "official" -- I suspect people would > be > > > interested in public open source examples of various types of indexes > > that > > > they could adapt to their own needs. > > > > > > > Having at least a central location with namespaced index names and links > to > > the description of their implementation would help more libraries > leverage > > these indexes. Because the lack of standardization means only the > libraries > > that produced the files can leverage them. Which defeats the purpose of > > using an open format somewhat. > > > > -- > > Felipe > > > > > > > Andrew > > > > > > On Wed, Jul 16, 2025 at 7:16 AM wish maple <maplewish...@gmail.com> > > wrote: > > > > > > > Seems good. Personally I think > > > > > > > > 1. Parquet file format seems have index page [1], but I don't know > > who's > > > > using it. > > > > 2. Currently, Parquet only have single column bloom filter and column > > > > index. Maybe > > > > some kind of multi-column or other filter might work > > > > 3. Index can have different "levels", like Page Index is designed for > > > > "Page", and bloom > > > > filter / statistics for RowGroup. We can even define index for > > "file" > > > > > > > > Currently I don't know whether we can have some "offcial" sample > index. > > > > Personally I > > > > might be interested in some "sketches" > > > > > > > > Best, > > > > Xuwei Fu > > > > > > > > [1] > > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L655 > > > > > > > > Andrew Lamb <andrewlam...@gmail.com> 于2025年7月16日周三 19:08写道: > > > > > > > > > I wrote a blog with Qi Zhu, Jigao Luo explaining how to embed user > > > > defined > > > > > indexes into Parquet files without needing any changes to the > > > format[1]. > > > > > > > > > > I am sorry for the somewhat shameless self promotion, but I think > > this > > > > > topic may be of general interest to the community in the context of > > > other > > > > > extensions to the format we have discussed recently. Techniques > such > > as > > > > > this widen potential usecases of Parquet without any need for > > > consensus > > > > or > > > > > timeline for ecosystem adoption. > > > > > > > > > > Andrew > > > > > > > > > > [1]: > > > > > > > > > > > > > > > https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/ > > > > > > > > > > > > > > >