Seems good. Personally I think

1. Parquet file format seems have index page [1], but I don't know who's
using it.
2. Currently, Parquet only have single column bloom filter and column
index. Maybe
    some kind of multi-column or other filter might work
3. Index can have different "levels", like Page Index is designed for
"Page", and bloom
    filter / statistics for RowGroup. We can even define index for "file"

Currently I don't know whether we can have some "offcial" sample index.
Personally I
might be interested in some "sketches"

Best,
Xuwei Fu

[1]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L655

Andrew Lamb <andrewlam...@gmail.com> 于2025年7月16日周三 19:08写道:

> I wrote a blog with Qi Zhu, Jigao Luo explaining how to embed user defined
> indexes into Parquet files without needing any changes to the format[1].
>
> I am sorry for the somewhat shameless self promotion, but I think this
> topic may be of general interest to the community in the context of other
> extensions to the format we have discussed recently. Techniques such as
> this widen potential usecases of  Parquet without any need for consensus or
> timeline for ecosystem adoption.
>
> Andrew
>
> [1]:
> https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/
>

Reply via email to