Hi team, I’ve put together a proof‑of‑concept implementation for secondary index metadata, along with a related in‑memory catalog: https://github.com/apache/iceberg/pull/15101
This should be useful for exercising and validating specific index implementations as part of the overall proposal. Feel free to take a look and share feedback, keeping in mind that this is a PoC and not intended for production use. Thanks, Peter Péter Váry <[email protected]> ezt írta (időpont: 2026. jan. 13., K, 14:34): > Hi Vaibhav, > > We currently have the generic Index proposal, which outlines how Iceberg > indexes can be stored and accessed by query engines. It defines the > structure and handling of index metadata. To properly validate the design, > we are proposing to implement a few concrete index types. This will help us > identify gaps and refine the overall approach. > > In the documents, we outlined four initial index types: > > - Bloom filter index – covered in Huaxin’s document > - B‑Tree index – backed by a materialized view > - Full‑text index – backed by a materialized view > - IVF index – backed by a materialized view > > The advantage of these index types is that many of their underlying > components already exist, or are covered by other ongoing proposals. This > means we can implement them—and even use them—with relatively low effort. > > Additional index types can be introduced later by the community. Once the > index metadata model is in place, adding new index implementations becomes > straightforward. > > We don’t yet have exact timelines for the roadmap. Our first step is to > build community consensus around the proposal; implementation can begin > once we have alignment. > > I hope this clarifies things. If you have any further questions, please > let me know. > > Thanks, > Peter > > Vaibhav Kumar <[email protected]> ezt írta (időpont: 2026. jan. 13., > K, 12:23): > >> Hi Peter/Huaxin, >> >> This is a very interesting topic—thank you for sharing all the >> documentation. I have a few questions I hope you can clarify: >> >> Does this mean that the three types of indexes—B-Tree, Full-Text, and >> IVF—can all be addressed through the use of materialized views? Or are >> there scenarios where dedicated index structures are still necessary? Doc >> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0> >> referred >> >> I’m also interested in the current roadmap for secondary indexes. Are >> there any concrete plans or timelines for their introduction in upcoming >> releases? Additionally, is there a draft or active pull request for this >> feature? I am happy to collaborate on this topic. >> >> Thank you in advance for your insights! >> >> Regards, >> Vaibhav >> >> >> On Tue, Jan 13, 2026 at 6:43 AM huaxin gao <[email protected]> >> wrote: >> >>> Hi Peter, >>> >>> Thanks for the clarification. I will align the secondary index proposal >>> accordingly. >>> >>> Looking forward to the collaboration! >>> >>> Best, >>> Huaxin >>> >>> On Mon, Jan 12, 2026 at 2:54 AM Péter Váry <[email protected]> >>> wrote: >>> >>>> Cool! >>>> Happy to collaborate on this! >>>> >>>> > keep only minimal snapshot references in table metadata and move the >>>> richer index definition and lifecycle into catalog‑managed index metadata >>>> exposed via the REST APIs. >>>> >>>> In my second iteration, I moved the snapshot references into the index >>>> metadata [1]. This allows the query engine to fetch indexes in parallel >>>> with the table metadata using *catalog.listIndexes*, where each >>>> returned *BaseIndex* already includes the available table snapshots. >>>> With that information, the engine can immediately determine whether a >>>> given index is applicable for the query by checking the index type, index >>>> columns, and the associated table snapshots. >>>> If the engine decides to use a particular index, it can then retrieve >>>> the corresponding DetailedIndex, which contains all additional details >>>> required by the engine. >>>> For Bloom filter indexes specifically, the *IndexSnapshots* could >>>> store the correct Puffin file path for each table snapshot in their >>>> snapshot properties. >>>> >>>> [1] - Iceberg indexes / Index Metadata / Snapshot - >>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.r3lv3a6k06hy >>>> >>>> huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 12., >>>> H, 2:27): >>>> >>>>> Hi Peter, >>>>> >>>>> >>>>> Thanks a lot for sharing the proposal in [1] and for the detailed >>>>> design. The catalog‑managed index framework there looks like a better >>>>> long‑term direction than keeping full index definitions in table metadata. >>>>> >>>>> >>>>> The current Bloom‑filter draft describes indexes in table metadata so >>>>> planners can discover them during planning and map table snapshots to >>>>> Puffin files with Bloom filters, but that wiring can be changed easily to >>>>> the catalog‑based model in [1]: keep only minimal snapshot references in >>>>> table metadata and move the richer index definition and lifecycle into >>>>> catalog‑managed index metadata exposed via the REST APIs. In that model, >>>>> the Bloom‑filter file‑skipping index would be one concrete `IndexType` >>>>> whose data lives in Puffin files, with engines discovering and loading it >>>>> through the catalog (`listIndexes`, `loadIndex`, etc.). >>>>> >>>>> >>>>> Agree that the Bloom‑filter index would be an excellent candidate and >>>>> a very good fit as the first index type to implement in this framework, >>>>> and >>>>> the proposal will be updated to follow the catalog‑based approach. >>>>> >>>>> >>>>> Best, >>>>> >>>>> Huaxin >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Jan 9, 2026 at 11:59 AM Péter Váry < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Huaxin, >>>>>> >>>>>> This is a very interesting topic. We’re also working on an index >>>>>> proposal [1] that aligns closely with yours in many areas. In an earlier >>>>>> iteration, I considered adding index metadata directly to the table >>>>>> metadata as well. After some back-and-forth, we ultimately moved to a >>>>>> different approach, where the catalog exposes an API to fetch the indexes >>>>>> for a given table. >>>>>> >>>>>> This has several advantages—for example, it avoids increasing the >>>>>> size of the table metadata and is more consistent with existing practices >>>>>> where UDFs, views, and materialized views each have their own >>>>>> specifications and metadata. >>>>>> >>>>>> After reading your proposal, I think the bloom filter index would be >>>>>> an excellent candidate and a very good fit as a first index type to >>>>>> implement, helping us evaluate the viability of the metadata approach. >>>>>> >>>>>> Please take a look and let me know what you think. >>>>>> Thanks, >>>>>> Peter >>>>>> >>>>>> [1] - >>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0 >>>>>> >>>>>> >>>>>> huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. >>>>>> 8., Cs, 17:27): >>>>>> >>>>>>> Hi Iceberg community, >>>>>>> >>>>>>> I’d like to request feedback on a proposal >>>>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0> >>>>>>> to introduce secondary indexes to Apache Iceberg with a narrow, >>>>>>> incremental >>>>>>> scope. >>>>>>> >>>>>>> Phase 1 adds file-skipping indexes based on per-column Bloom >>>>>>> filters, stored in Puffin and referenced from table metadata so query >>>>>>> engines can use them during planning to prune data files. Indexes are >>>>>>> advisory-only and snapshot-scoped. The proposal is fully backward >>>>>>> compatible: engines that don’t understand the new metadata fields ignore >>>>>>> them. >>>>>>> >>>>>>> I’d appreciate any feedback, questions, or concerns on the overall >>>>>>> direction and design. >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Huaxin >>>>>>> >>>>>>
