Hi Peter, Thanks for the clarification. I will align the secondary index proposal accordingly.
Looking forward to the collaboration! Best, Huaxin On Mon, Jan 12, 2026 at 2:54 AM Péter Váry <[email protected]> wrote: > Cool! > Happy to collaborate on this! > > > keep only minimal snapshot references in table metadata and move the > richer index definition and lifecycle into catalog‑managed index metadata > exposed via the REST APIs. > > In my second iteration, I moved the snapshot references into the index > metadata [1]. This allows the query engine to fetch indexes in parallel > with the table metadata using *catalog.listIndexes*, where each returned > *BaseIndex* already includes the available table snapshots. > With that information, the engine can immediately determine whether a > given index is applicable for the query by checking the index type, index > columns, and the associated table snapshots. > If the engine decides to use a particular index, it can then retrieve the > corresponding DetailedIndex, which contains all additional details required > by the engine. > For Bloom filter indexes specifically, the *IndexSnapshots* could store > the correct Puffin file path for each table snapshot in their snapshot > properties. > > [1] - Iceberg indexes / Index Metadata / Snapshot - > https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.r3lv3a6k06hy > > huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 12., H, > 2:27): > >> Hi Peter, >> >> >> Thanks a lot for sharing the proposal in [1] and for the detailed design. >> The catalog‑managed index framework there looks like a better long‑term >> direction than keeping full index definitions in table metadata. >> >> >> The current Bloom‑filter draft describes indexes in table metadata so >> planners can discover them during planning and map table snapshots to >> Puffin files with Bloom filters, but that wiring can be changed easily to >> the catalog‑based model in [1]: keep only minimal snapshot references in >> table metadata and move the richer index definition and lifecycle into >> catalog‑managed index metadata exposed via the REST APIs. In that model, >> the Bloom‑filter file‑skipping index would be one concrete `IndexType` >> whose data lives in Puffin files, with engines discovering and loading it >> through the catalog (`listIndexes`, `loadIndex`, etc.). >> >> >> Agree that the Bloom‑filter index would be an excellent candidate and a >> very good fit as the first index type to implement in this framework, and >> the proposal will be updated to follow the catalog‑based approach. >> >> >> Best, >> >> Huaxin >> >> >> >> >> >> On Fri, Jan 9, 2026 at 11:59 AM Péter Váry <[email protected]> >> wrote: >> >>> Hi Huaxin, >>> >>> This is a very interesting topic. We’re also working on an index >>> proposal [1] that aligns closely with yours in many areas. In an earlier >>> iteration, I considered adding index metadata directly to the table >>> metadata as well. After some back-and-forth, we ultimately moved to a >>> different approach, where the catalog exposes an API to fetch the indexes >>> for a given table. >>> >>> This has several advantages—for example, it avoids increasing the size >>> of the table metadata and is more consistent with existing practices where >>> UDFs, views, and materialized views each have their own specifications and >>> metadata. >>> >>> After reading your proposal, I think the bloom filter index would be an >>> excellent candidate and a very good fit as a first index type to implement, >>> helping us evaluate the viability of the metadata approach. >>> >>> Please take a look and let me know what you think. >>> Thanks, >>> Peter >>> >>> [1] - >>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0 >>> >>> >>> huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 8., >>> Cs, 17:27): >>> >>>> Hi Iceberg community, >>>> >>>> I’d like to request feedback on a proposal >>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0> >>>> to introduce secondary indexes to Apache Iceberg with a narrow, incremental >>>> scope. >>>> >>>> Phase 1 adds file-skipping indexes based on per-column Bloom filters, >>>> stored in Puffin and referenced from table metadata so query engines can >>>> use them during planning to prune data files. Indexes are advisory-only and >>>> snapshot-scoped. The proposal is fully backward compatible: engines that >>>> don’t understand the new metadata fields ignore them. >>>> >>>> I’d appreciate any feedback, questions, or concerns on the overall >>>> direction and design. >>>> >>>> Best, >>>> >>>> Huaxin >>>> >>>
