Hi everyone, Here's a proposal for native Vector Index support in Iceberg tables -- https://docs.google.com/document/d/1KL4qLOwdqnhOcqTc0EjO1O16NV3M3c-gZCEINDWw4lA/edit?usp=sharing We've been working on this proposal with Peter internally at Microsoft and he suggested we post it here to bring this to the community's attention, ahead of the next Secondary Index Sync.
Thanks, Suhas On 2026/02/19 04:34:34 huaxin gao wrote: > Hi Everyone, > > Here are the recording and notes from the Iceberg Index Support Sync on > 2/11. > > Recording: https://www.youtube.com/watch?v=3sFfQ0A50yk > > Notes: > https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.8041k7j2n7y3 > > The meeting will move to biweekly, Mondays 9–10am PST, starting March 2. > > Since the sync, I updated the Bloom skipping index proposal > <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.5r5kl6k3fqwu> > to address the discussion questions, specifically: > > > - Performance justification: when this helps (high-cardinality = / IN, > many data files, high object-store latency) and how it differs from Parquet > row-group Bloom filters (which still require opening the data file). > - Cost / scalability: rough sizing (Bloom blob size per file, Puffin > file size), the planning cost trade-off (driver index reads vs executor > file opens), and mitigations via caching. > - Lifecycle / maintenance: incremental production as new data files > arrive, behavior when the index is missing/behind, and sharding/compaction > plus cleanup to avoid accumulating too many small Puffin files over time. > - Writer expectations: inline (optional) vs asynchronous (primary) index > creation. > > I also implemented a Spark 4.1 POC > <https://github.com/apache/iceberg/pull/15311> and a local benchmark to > quantify both the pruning impact (plannedFiles → afterBloom) and the index > read overhead (statsFiles, statsBytes, bloomPayloadBytes) for point > predicates on high-cardinality columns. Please take a look and let me know > if you have any questions or feedback. > > Thanks, > > Huaxin > > On Tue, Feb 10, 2026 at 1:43 PM huaxin gao > <[email protected]<mailto:[email protected]>> wrote: > > > Reminder for tomorrow's sync on Iceberg Index Support. > > > > Wednesday: Feb. 11 9:00 – 10:00am > > Time zone: America/Los_Angeles > > Google Meet joining info > > Video call link: meet.google.com/nsp-ctyr-khk > > Design doc: > > > > https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2 > > > > https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7 > > > > Thanks, > > Huaxin > > > > > > On Tue, Feb 3, 2026 at 10:52 PM Péter Váry > > <[email protected]<mailto:[email protected]>> > > wrote: > > > >> Thanks Huaxin and Steven for organizing this. Looking forward to meet you > >> all next week! > >> > >> On Wed, Feb 4, 2026, 02:48 Steven Wu > >> <[email protected]<mailto:[email protected]>> wrote: > >> > >>> We set up the dev calendar event with a new google meet link. Please > >>> ignore the link from Huaxin's original email. > >>> > >>> The dev calendar has the correct info (including the new meeting link) > >>> > >>> Iceberg Index Support Sync > >>> Wednesday, February 11 · 9:00 – 10:00am > >>> Time zone: America/Los_Angeles > >>> Google Meet joining info > >>> Video call link: https://meet.google.com/nsp-ctyr-khk > >>> > >>> On Tue, Feb 3, 2026 at 5:08 PM huaxin gao > >>> <[email protected]<mailto:[email protected]>> > >>> wrote: > >>> > >>>> Sorry, I meant PST (not EST) :) > >>>> Looking forward to the discussion! > >>>> > >>>> On Tue, Feb 3, 2026 at 4:58 PM Shawn Chang > >>>> <[email protected]<mailto:[email protected]>> > >>>> wrote: > >>>> > >>>>> Hi Huaxin, > >>>>> > >>>>> Thanks for starting the sync! > >>>>> > >>>>> The meeting seems to be 9-10AM PST on the dev events calendar > >>>>> <https://calendar.google.com/calendar/u/0?cid=MzkwNWQ0OTJmMWI0NTBiYTA3MTJmMmFlNmFmYTc2ZWI3NTdmMTNkODUyMjBjYzAzYWE0NTI3ODg1YWRjNTYyOUBncm91cC5jYWxlbmRhci5nb29nbGUuY29t>, > >>>>> not EST. Maybe it's a typo? > >>>>> Otherwise, looking forward to the discussion! > >>>>> > >>>>> Best, > >>>>> Shawn > >>>>> > >>>>> On Tue, Feb 3, 2026 at 9:18 AM huaxin gao > >>>>> <[email protected]<mailto:[email protected]>> > >>>>> wrote: > >>>>> > >>>>>> Hi all, > >>>>>> I'd like to start a dedicated sync to discuss Iceberg Index support. > >>>>>> Here is the existing discussion thread: > >>>>>> https://lists.apache.org/thread/fzqk3jjf0xpj5m4cfqb3v4c65p0t04ty. > >>>>>> > >>>>>> To ground the discussion, here are the two proposals: > >>>>>> > >>>>>> - Peter's proposal > >>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2> > >>>>>> (overall > >>>>>> index support) > >>>>>> - My proposal > >>>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7> > >>>>>> (bloom filter skipping index) > >>>>>> > >>>>>> Time slot: Every 3 weeks, Wednesdays at 9 AM to 10 AM EST, starting > >>>>>> next Wednesday (2/11). After FileFormat sync finishes, we plan to use > >>>>>> that > >>>>>> slot and switch to every other Monday, 9 AM to 10 AM EST. > >>>>>> > >>>>>> Meet link: https://meet.google.com/fjn-tyze-mko > >>>>>> > >>>>>> Thanks, > >>>>>> Huaxin > >>>>>> > >>>>> >
