Re: Dedicated sync for Iceberg Index Support

Péter Váry Tue, 24 Feb 2026 03:52:41 -0800

We had an extended discussion on Slack with Dan, Steven, and Yufei about
where index metadata should live. In particular, whether it should be
stored directly in the table metadata or maintained in a dedicated index
catalog. I tried to capture this discussion in the Layout
<https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.4oz3yd6ngr3>
section
of the document.


Once the decision is made, this section can be shortened, but for now it is
intentionally more detailed so that everyone can see the arguments that
were discussed and so that those who could not participate synchronously
can still follow and provide feedback offline.

In short, we are currently *leaning toward storing index metadata in its
own catalog*, while allowing REST catalogs to expose a composite endpoint
that returns both table and index metadata in a single round trip. This is
similar in spirit to the universal load endpoint discussed in the context
of materialized view loading.

Thanks,
Peter

Péter Váry <[email protected]> ezt írta (időpont: 2026. febr.
19., Cs, 14:06):

> Thanks Huaxin for posting the recording and the meeting notes.
>
> I used this time to also address the questions collected during the sync:
>
>    - Collected some representative use cases. See the example use-cases
>    
> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.i4gt8za99j9d>
>  paragraph.
>    Anyone should feel free to suggest their own.
>    - Collected my thoughts about the writer requirements. See the writer
>    requirements
>    
> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.4b1p8r8nmfg1>
>    paragraph.
>    - Centralized the index maintenance related parts. See the index
>    maintenance
>    
> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.hw2nt44i0k8q>
>    paragraph.
>
> Might be a bit premature but created a PR
> <https://github.com/apache/iceberg/pull/15101> with the proposed index
> catalog related changes, so the ones who are more code oriented could take
> a look at it too.
>
> huaxin gao <[email protected]> ezt írta (időpont: 2026. febr. 19.,
> Cs, 5:34):
>
>> Hi Everyone,
>>
>> Here are the recording and notes from the Iceberg Index Support Sync on
>> 2/11.
>>
>> Recording: https://www.youtube.com/watch?v=3sFfQ0A50yk
>>
>> Notes:
>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.8041k7j2n7y3
>>
>> The meeting will move to biweekly, Mondays 9–10am PST, starting March 2.
>>
>> Since the sync, I updated the Bloom skipping index proposal
>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.5r5kl6k3fqwu>
>> to address the discussion questions, specifically:
>>
>>
>>    - Performance justification: when this helps (high-cardinality = /
>>    IN, many data files, high object-store latency) and how it differs from
>>    Parquet row-group Bloom filters (which still require opening the data 
>> file).
>>    - Cost / scalability: rough sizing (Bloom blob size per file, Puffin
>>    file size), the planning cost trade-off (driver index reads vs executor
>>    file opens), and mitigations via caching.
>>    - Lifecycle / maintenance: incremental production as new data files
>>    arrive, behavior when the index is missing/behind, and sharding/compaction
>>    plus cleanup to avoid accumulating too many small Puffin files over time.
>>    - Writer expectations: inline (optional) vs asynchronous (primary)
>>    index creation.
>>
>> I also implemented a Spark 4.1 POC
>> <https://github.com/apache/iceberg/pull/15311> and a local benchmark to
>> quantify both the pruning impact (plannedFiles → afterBloom) and the index
>> read overhead (statsFiles, statsBytes, bloomPayloadBytes) for point
>> predicates on high-cardinality columns. Please take a look and let me know
>> if you have any questions or feedback.
>>
>> Thanks,
>>
>> Huaxin
>>
>> On Tue, Feb 10, 2026 at 1:43 PM huaxin gao <[email protected]>
>> wrote:
>>
>>> Reminder for tomorrow's sync on Iceberg Index Support.
>>>
>>> Wednesday: Feb. 11 9:00 – 10:00am
>>> Time zone: America/Los_Angeles
>>> Google Meet joining info
>>> Video call link: meet.google.com/nsp-ctyr-khk
>>> Design doc:
>>>
>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2
>>>
>>> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7
>>>
>>> Thanks,
>>> Huaxin
>>>
>>>
>>> On Tue, Feb 3, 2026 at 10:52 PM Péter Váry <[email protected]>
>>> wrote:
>>>
>>>> Thanks Huaxin and Steven for organizing this. Looking forward to meet
>>>> you all next week!
>>>>
>>>> On Wed, Feb 4, 2026, 02:48 Steven Wu <[email protected]> wrote:
>>>>
>>>>> We set up the dev calendar event with a new google meet link. Please
>>>>> ignore the link from Huaxin's original email.
>>>>>
>>>>> The dev calendar has the correct info (including the new meeting link)
>>>>>
>>>>> Iceberg Index Support Sync
>>>>> Wednesday, February 11 · 9:00 – 10:00am
>>>>> Time zone: America/Los_Angeles
>>>>> Google Meet joining info
>>>>> Video call link: https://meet.google.com/nsp-ctyr-khk
>>>>>
>>>>> On Tue, Feb 3, 2026 at 5:08 PM huaxin gao <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Sorry, I meant PST (not EST) :)
>>>>>> Looking forward to the discussion!
>>>>>>
>>>>>> On Tue, Feb 3, 2026 at 4:58 PM Shawn Chang <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Huaxin,
>>>>>>>
>>>>>>> Thanks for starting the sync!
>>>>>>>
>>>>>>> The meeting seems to be 9-10AM PST on the dev events calendar
>>>>>>> <https://calendar.google.com/calendar/u/0?cid=MzkwNWQ0OTJmMWI0NTBiYTA3MTJmMmFlNmFmYTc2ZWI3NTdmMTNkODUyMjBjYzAzYWE0NTI3ODg1YWRjNTYyOUBncm91cC5jYWxlbmRhci5nb29nbGUuY29t>,
>>>>>>> not EST. Maybe it's a typo?
>>>>>>> Otherwise, looking forward to the discussion!
>>>>>>>
>>>>>>> Best,
>>>>>>> Shawn
>>>>>>>
>>>>>>> On Tue, Feb 3, 2026 at 9:18 AM huaxin gao <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> I'd like to start a dedicated sync to discuss Iceberg Index
>>>>>>>> support. Here is the existing discussion thread:
>>>>>>>> https://lists.apache.org/thread/fzqk3jjf0xpj5m4cfqb3v4c65p0t04ty.
>>>>>>>>
>>>>>>>> To ground the discussion, here are the two proposals:
>>>>>>>>
>>>>>>>>    - Peter's proposal
>>>>>>>>    
>>>>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2>
>>>>>>>>  (overall
>>>>>>>>    index support)
>>>>>>>>    - My proposal
>>>>>>>>    
>>>>>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7>
>>>>>>>>    (bloom filter skipping index)
>>>>>>>>
>>>>>>>> Time slot: Every 3 weeks, Wednesdays at 9 AM to 10 AM EST, starting
>>>>>>>> next Wednesday (2/11). After FileFormat sync finishes, we plan to use 
>>>>>>>> that
>>>>>>>> slot and switch to every other Monday, 9 AM to 10 AM EST.
>>>>>>>>
>>>>>>>> Meet link: https://meet.google.com/fjn-tyze-mko
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Huaxin
>>>>>>>>
>>>>>>>

Re: Dedicated sync for Iceberg Index Support

Reply via email to