Re: [DISCUSS] Secondary Indexes (Phase 1): Bloom filter skipping index (Puffin, snapshot-scoped)

Péter Váry Mon, 12 Jan 2026 02:53:55 -0800

Cool!
Happy to collaborate on this!

> keep only minimal snapshot references in table metadata and move the
richer index definition and lifecycle into catalog‑managed index metadata
exposed via the REST APIs.


In my second iteration, I moved the snapshot references into the index
metadata [1]. This allows the query engine to fetch indexes in parallel
with the table metadata using *catalog.listIndexes*, where each returned
*BaseIndex* already includes the available table snapshots.
With that information, the engine can immediately determine whether a given
index is applicable for the query by checking the index type, index
columns, and the associated table snapshots.
If the engine decides to use a particular index, it can then retrieve the
corresponding DetailedIndex, which contains all additional details required
by the engine.
For Bloom filter indexes specifically, the *IndexSnapshots* could store the
correct Puffin file path for each table snapshot in their snapshot
properties.

[1] - Iceberg indexes / Index Metadata / Snapshot -
https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.r3lv3a6k06hy

huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 12., H,
2:27):

> Hi Peter,
>
>
> Thanks a lot for sharing the proposal in [1] and for the detailed design.
> The catalog‑managed index framework there looks like a better long‑term
> direction than keeping full index definitions in table metadata.
>
>
> The current Bloom‑filter draft describes indexes in table metadata so
> planners can discover them during planning and map table snapshots to
> Puffin files with Bloom filters, but that wiring can be changed easily to
> the catalog‑based model in [1]: keep only minimal snapshot references in
> table metadata and move the richer index definition and lifecycle into
> catalog‑managed index metadata exposed via the REST APIs. In that model,
> the Bloom‑filter file‑skipping index would be one concrete `IndexType`
> whose data lives in Puffin files, with engines discovering and loading it
> through the catalog (`listIndexes`, `loadIndex`, etc.).
>
>
> Agree that the Bloom‑filter index would be an excellent candidate and a
> very good fit as the first index type to implement in this framework, and
> the proposal will be updated to follow the catalog‑based approach.
>
>
> Best,
>
> Huaxin
>
>
>
>
>
> On Fri, Jan 9, 2026 at 11:59 AM Péter Váry <[email protected]>
> wrote:
>
>> Hi Huaxin,
>>
>> This is a very interesting topic. We’re also working on an index proposal
>> [1] that aligns closely with yours in many areas. In an earlier iteration,
>> I considered adding index metadata directly to the table metadata as well.
>> After some back-and-forth, we ultimately moved to a different approach,
>> where the catalog exposes an API to fetch the indexes for a given table.
>>
>> This has several advantages—for example, it avoids increasing the size of
>> the table metadata and is more consistent with existing practices where
>> UDFs, views, and materialized views each have their own specifications and
>> metadata.
>>
>> After reading your proposal, I think the bloom filter index would be an
>> excellent candidate and a very good fit as a first index type to implement,
>> helping us evaluate the viability of the metadata approach.
>>
>> Please take a look and let me know what you think.
>> Thanks,
>> Peter
>>
>> [1] -
>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0
>>
>>
>> huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 8.,
>> Cs, 17:27):
>>
>>> Hi Iceberg community,
>>>
>>> I’d like to request feedback on a proposal
>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0>
>>> to introduce secondary indexes to Apache Iceberg with a narrow, incremental
>>> scope.
>>>
>>> Phase 1 adds file-skipping indexes based on per-column Bloom filters,
>>> stored in Puffin and referenced from table metadata so query engines can
>>> use them during planning to prune data files. Indexes are advisory-only and
>>> snapshot-scoped. The proposal is fully backward compatible: engines that
>>> don’t understand the new metadata fields ignore them.
>>>
>>> I’d appreciate any feedback, questions, or concerns on the overall
>>> direction and design.
>>>
>>> Best,
>>>
>>> Huaxin
>>>
>>

Re: [DISCUSS] Secondary Indexes (Phase 1): Bloom filter skipping index (Puffin, snapshot-scoped)

Reply via email to