Hi Everyone, Here are the recording and notes from the Iceberg Index Support Sync on 2/11.
Recording: https://www.youtube.com/watch?v=3sFfQ0A50yk Notes: https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.8041k7j2n7y3 The meeting will move to biweekly, Mondays 9–10am PST, starting March 2. Since the sync, I updated the Bloom skipping index proposal <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.5r5kl6k3fqwu> to address the discussion questions, specifically: - Performance justification: when this helps (high-cardinality = / IN, many data files, high object-store latency) and how it differs from Parquet row-group Bloom filters (which still require opening the data file). - Cost / scalability: rough sizing (Bloom blob size per file, Puffin file size), the planning cost trade-off (driver index reads vs executor file opens), and mitigations via caching. - Lifecycle / maintenance: incremental production as new data files arrive, behavior when the index is missing/behind, and sharding/compaction plus cleanup to avoid accumulating too many small Puffin files over time. - Writer expectations: inline (optional) vs asynchronous (primary) index creation. I also implemented a Spark 4.1 POC <https://github.com/apache/iceberg/pull/15311> and a local benchmark to quantify both the pruning impact (plannedFiles → afterBloom) and the index read overhead (statsFiles, statsBytes, bloomPayloadBytes) for point predicates on high-cardinality columns. Please take a look and let me know if you have any questions or feedback. Thanks, Huaxin On Tue, Feb 10, 2026 at 1:43 PM huaxin gao <[email protected]> wrote: > Reminder for tomorrow's sync on Iceberg Index Support. > > Wednesday: Feb. 11 9:00 – 10:00am > Time zone: America/Los_Angeles > Google Meet joining info > Video call link: meet.google.com/nsp-ctyr-khk > Design doc: > > https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2 > > https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7 > > Thanks, > Huaxin > > > On Tue, Feb 3, 2026 at 10:52 PM Péter Váry <[email protected]> > wrote: > >> Thanks Huaxin and Steven for organizing this. Looking forward to meet you >> all next week! >> >> On Wed, Feb 4, 2026, 02:48 Steven Wu <[email protected]> wrote: >> >>> We set up the dev calendar event with a new google meet link. Please >>> ignore the link from Huaxin's original email. >>> >>> The dev calendar has the correct info (including the new meeting link) >>> >>> Iceberg Index Support Sync >>> Wednesday, February 11 · 9:00 – 10:00am >>> Time zone: America/Los_Angeles >>> Google Meet joining info >>> Video call link: https://meet.google.com/nsp-ctyr-khk >>> >>> On Tue, Feb 3, 2026 at 5:08 PM huaxin gao <[email protected]> >>> wrote: >>> >>>> Sorry, I meant PST (not EST) :) >>>> Looking forward to the discussion! >>>> >>>> On Tue, Feb 3, 2026 at 4:58 PM Shawn Chang <[email protected]> >>>> wrote: >>>> >>>>> Hi Huaxin, >>>>> >>>>> Thanks for starting the sync! >>>>> >>>>> The meeting seems to be 9-10AM PST on the dev events calendar >>>>> <https://calendar.google.com/calendar/u/0?cid=MzkwNWQ0OTJmMWI0NTBiYTA3MTJmMmFlNmFmYTc2ZWI3NTdmMTNkODUyMjBjYzAzYWE0NTI3ODg1YWRjNTYyOUBncm91cC5jYWxlbmRhci5nb29nbGUuY29t>, >>>>> not EST. Maybe it's a typo? >>>>> Otherwise, looking forward to the discussion! >>>>> >>>>> Best, >>>>> Shawn >>>>> >>>>> On Tue, Feb 3, 2026 at 9:18 AM huaxin gao <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> I'd like to start a dedicated sync to discuss Iceberg Index support. >>>>>> Here is the existing discussion thread: >>>>>> https://lists.apache.org/thread/fzqk3jjf0xpj5m4cfqb3v4c65p0t04ty. >>>>>> >>>>>> To ground the discussion, here are the two proposals: >>>>>> >>>>>> - Peter's proposal >>>>>> >>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2> >>>>>> (overall >>>>>> index support) >>>>>> - My proposal >>>>>> >>>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7> >>>>>> (bloom filter skipping index) >>>>>> >>>>>> Time slot: Every 3 weeks, Wednesdays at 9 AM to 10 AM EST, starting >>>>>> next Wednesday (2/11). After FileFormat sync finishes, we plan to use >>>>>> that >>>>>> slot and switch to every other Monday, 9 AM to 10 AM EST. >>>>>> >>>>>> Meet link: https://meet.google.com/fjn-tyze-mko >>>>>> >>>>>> Thanks, >>>>>> Huaxin >>>>>> >>>>>
