Thanks Huaxin for posting the recording and the meeting notes. I used this time to also address the questions collected during the sync:
- Collected some representative use cases. See the example use-cases <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.i4gt8za99j9d> paragraph. Anyone should feel free to suggest their own. - Collected my thoughts about the writer requirements. See the writer requirements <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.4b1p8r8nmfg1> paragraph. - Centralized the index maintenance related parts. See the index maintenance <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.hw2nt44i0k8q> paragraph. Might be a bit premature but created a PR <https://github.com/apache/iceberg/pull/15101> with the proposed index catalog related changes, so the ones who are more code oriented could take a look at it too. huaxin gao <[email protected]> ezt írta (időpont: 2026. febr. 19., Cs, 5:34): > Hi Everyone, > > Here are the recording and notes from the Iceberg Index Support Sync on > 2/11. > > Recording: https://www.youtube.com/watch?v=3sFfQ0A50yk > > Notes: > https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.8041k7j2n7y3 > > The meeting will move to biweekly, Mondays 9–10am PST, starting March 2. > > Since the sync, I updated the Bloom skipping index proposal > <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.5r5kl6k3fqwu> > to address the discussion questions, specifically: > > > - Performance justification: when this helps (high-cardinality = / IN, > many data files, high object-store latency) and how it differs from Parquet > row-group Bloom filters (which still require opening the data file). > - Cost / scalability: rough sizing (Bloom blob size per file, Puffin > file size), the planning cost trade-off (driver index reads vs executor > file opens), and mitigations via caching. > - Lifecycle / maintenance: incremental production as new data files > arrive, behavior when the index is missing/behind, and sharding/compaction > plus cleanup to avoid accumulating too many small Puffin files over time. > - Writer expectations: inline (optional) vs asynchronous (primary) > index creation. > > I also implemented a Spark 4.1 POC > <https://github.com/apache/iceberg/pull/15311> and a local benchmark to > quantify both the pruning impact (plannedFiles → afterBloom) and the index > read overhead (statsFiles, statsBytes, bloomPayloadBytes) for point > predicates on high-cardinality columns. Please take a look and let me know > if you have any questions or feedback. > > Thanks, > > Huaxin > > On Tue, Feb 10, 2026 at 1:43 PM huaxin gao <[email protected]> wrote: > >> Reminder for tomorrow's sync on Iceberg Index Support. >> >> Wednesday: Feb. 11 9:00 – 10:00am >> Time zone: America/Los_Angeles >> Google Meet joining info >> Video call link: meet.google.com/nsp-ctyr-khk >> Design doc: >> >> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2 >> >> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7 >> >> Thanks, >> Huaxin >> >> >> On Tue, Feb 3, 2026 at 10:52 PM Péter Váry <[email protected]> >> wrote: >> >>> Thanks Huaxin and Steven for organizing this. Looking forward to meet >>> you all next week! >>> >>> On Wed, Feb 4, 2026, 02:48 Steven Wu <[email protected]> wrote: >>> >>>> We set up the dev calendar event with a new google meet link. Please >>>> ignore the link from Huaxin's original email. >>>> >>>> The dev calendar has the correct info (including the new meeting link) >>>> >>>> Iceberg Index Support Sync >>>> Wednesday, February 11 · 9:00 – 10:00am >>>> Time zone: America/Los_Angeles >>>> Google Meet joining info >>>> Video call link: https://meet.google.com/nsp-ctyr-khk >>>> >>>> On Tue, Feb 3, 2026 at 5:08 PM huaxin gao <[email protected]> >>>> wrote: >>>> >>>>> Sorry, I meant PST (not EST) :) >>>>> Looking forward to the discussion! >>>>> >>>>> On Tue, Feb 3, 2026 at 4:58 PM Shawn Chang <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Huaxin, >>>>>> >>>>>> Thanks for starting the sync! >>>>>> >>>>>> The meeting seems to be 9-10AM PST on the dev events calendar >>>>>> <https://calendar.google.com/calendar/u/0?cid=MzkwNWQ0OTJmMWI0NTBiYTA3MTJmMmFlNmFmYTc2ZWI3NTdmMTNkODUyMjBjYzAzYWE0NTI3ODg1YWRjNTYyOUBncm91cC5jYWxlbmRhci5nb29nbGUuY29t>, >>>>>> not EST. Maybe it's a typo? >>>>>> Otherwise, looking forward to the discussion! >>>>>> >>>>>> Best, >>>>>> Shawn >>>>>> >>>>>> On Tue, Feb 3, 2026 at 9:18 AM huaxin gao <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> I'd like to start a dedicated sync to discuss Iceberg Index support. >>>>>>> Here is the existing discussion thread: >>>>>>> https://lists.apache.org/thread/fzqk3jjf0xpj5m4cfqb3v4c65p0t04ty. >>>>>>> >>>>>>> To ground the discussion, here are the two proposals: >>>>>>> >>>>>>> - Peter's proposal >>>>>>> >>>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2> >>>>>>> (overall >>>>>>> index support) >>>>>>> - My proposal >>>>>>> >>>>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7> >>>>>>> (bloom filter skipping index) >>>>>>> >>>>>>> Time slot: Every 3 weeks, Wednesdays at 9 AM to 10 AM EST, starting >>>>>>> next Wednesday (2/11). After FileFormat sync finishes, we plan to use >>>>>>> that >>>>>>> slot and switch to every other Monday, 9 AM to 10 AM EST. >>>>>>> >>>>>>> Meet link: https://meet.google.com/fjn-tyze-mko >>>>>>> >>>>>>> Thanks, >>>>>>> Huaxin >>>>>>> >>>>>>
