Re: [DISCUSS] Secondary Indexes (Phase 1): Bloom filter skipping index (Puffin, snapshot-scoped)

huaxin gao Sun, 11 Jan 2026 17:28:28 -0800

Hi Peter,

Thanks a lot for sharing the proposal in [1] and for the detailed design.
The catalog‑managed index framework there looks like a better long‑term
direction than keeping full index definitions in table metadata.

The current Bloom‑filter draft describes indexes in table metadata so
planners can discover them during planning and map table snapshots to
Puffin files with Bloom filters, but that wiring can be changed easily to
the catalog‑based model in [1]: keep only minimal snapshot references in
table metadata and move the richer index definition and lifecycle into
catalog‑managed index metadata exposed via the REST APIs. In that model,
the Bloom‑filter file‑skipping index would be one concrete `IndexType`
whose data lives in Puffin files, with engines discovering and loading it
through the catalog (`listIndexes`, `loadIndex`, etc.).

Agree that the Bloom‑filter index would be an excellent candidate and a
very good fit as the first index type to implement in this framework, and
the proposal will be updated to follow the catalog‑based approach.

Best,

Huaxin

On Fri, Jan 9, 2026 at 11:59 AM Péter Váry <[email protected]>
wrote:

> Hi Huaxin,
>
> This is a very interesting topic. We’re also working on an index proposal
> [1] that aligns closely with yours in many areas. In an earlier iteration,
> I considered adding index metadata directly to the table metadata as well.
> After some back-and-forth, we ultimately moved to a different approach,
> where the catalog exposes an API to fetch the indexes for a given table.
>
> This has several advantages—for example, it avoids increasing the size of
> the table metadata and is more consistent with existing practices where
> UDFs, views, and materialized views each have their own specifications and
> metadata.
>
> After reading your proposal, I think the bloom filter index would be an
> excellent candidate and a very good fit as a first index type to implement,
> helping us evaluate the viability of the metadata approach.
>
> Please take a look and let me know what you think.
> Thanks,
> Peter
>
> [1] -
> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0
>
>
> huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 8., Cs,
> 17:27):
>
>> Hi Iceberg community,
>>
>> I’d like to request feedback on a proposal
>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0>
>> to introduce secondary indexes to Apache Iceberg with a narrow, incremental
>> scope.
>>
>> Phase 1 adds file-skipping indexes based on per-column Bloom filters,
>> stored in Puffin and referenced from table metadata so query engines can
>> use them during planning to prune data files. Indexes are advisory-only and
>> snapshot-scoped. The proposal is fully backward compatible: engines that
>> don’t understand the new metadata fields ignore them.
>>
>> I’d appreciate any feedback, questions, or concerns on the overall
>> direction and design.
>>
>> Best,
>>
>> Huaxin
>>
>

Re: [DISCUSS] Secondary Indexes (Phase 1): Bloom filter skipping index (Puffin, snapshot-scoped)

Reply via email to