Thanks Adam Indeed embedding the index directly in the metadata is another alternative (and common) approach and deserves mention in the post. I made a PR[1] to add that.
Another tradeoff of putting the index information directly in the key/value metadata footer is that all parquet readers must read (and ignore) all the index bytes when decoding the footer. Depending on the size of the index this may exacerbate concerns of footer decode speed and bloat. Clearly I don't know what is best for all systems, but we felt it was important to make sure the options with Parquet were clearer to help people decide for themselves. Andrew [1] https://github.com/apache/datafusion-site/pull/90 On Wed, Jul 16, 2025 at 7:54 PM Adam Reeve <adre...@gmail.com> wrote: > This was very interesting Andrew, thanks for sharing. We've done something > quite similar at G-Research in the past but embedded the index directly in > the key value metadata. That has the advantage of not needing an extra IO > operation to read the index after you've read the footer, and it was simple > to implement, but the index needs to be stored as a UTF-8 string so will > usually be less compact than a binary representation and have more > deserialization overhead. > > Cheers, > Adam > > > On Wed, 16 Jul 2025 at 23:16, wish maple <maplewish...@gmail.com> wrote: > > > Seems good. Personally I think > > > > 1. Parquet file format seems have index page [1], but I don't know who's > > using it. > > 2. Currently, Parquet only have single column bloom filter and column > > index. Maybe > > some kind of multi-column or other filter might work > > 3. Index can have different "levels", like Page Index is designed for > > "Page", and bloom > > filter / statistics for RowGroup. We can even define index for "file" > > > > Currently I don't know whether we can have some "offcial" sample index. > > Personally I > > might be interested in some "sketches" > > > > Best, > > Xuwei Fu > > > > [1] > > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L655 > > > > Andrew Lamb <andrewlam...@gmail.com> 于2025年7月16日周三 19:08写道: > > > > > I wrote a blog with Qi Zhu, Jigao Luo explaining how to embed user > > defined > > > indexes into Parquet files without needing any changes to the > format[1]. > > > > > > I am sorry for the somewhat shameless self promotion, but I think this > > > topic may be of general interest to the community in the context of > other > > > extensions to the format we have discussed recently. Techniques such as > > > this widen potential usecases of Parquet without any need for > consensus > > or > > > timeline for ecosystem adoption. > > > > > > Andrew > > > > > > [1]: > > > > > > https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/ > > > > > >