Re: Dedicated sync for Iceberg Index Support

Péter Váry Sat, 28 Feb 2026 03:53:34 -0800

Please note that the next *Secondary Index Sync* will take place on *March
2nd, 9:00-10:00 AM PT*.


*Proposed agenda*:

   - Discussion of potential use‑cases
      - Primary Key index for Flink equality‑delete resolution
      - Secondary data layout
         - Containing index
         - Alternative query plans
      - Vector index
   - Discussion of the two alternative approaches for metadata placement:
   keeping index metadata inside the table metadata vs. managing it externally
   through an Index Catalog
   - Bloom filter index status update
      - Performance justification: when this helps (high-cardinality = /
      IN, many data files, high object-store latency) and how it differs from
      Parquet row-group Bloom filters (which still require opening the
data file).
      - Cost / scalability: rough sizing (Bloom blob size per file, Puffin
      file size), the planning cost trade-off (driver index reads vs executor
      file opens), and mitigations via caching.
      - Lifecycle / maintenance: incremental production as new data files
      arrive, behavior when the index is missing/behind, and
sharding/compaction
      plus cleanup to avoid accumulating too many small Puffin files over time.
      - Writer expectations: inline (optional) vs asynchronous (primary)
      index creation.

Looking forward to diving into this topic together.

See you all there,
Peter

Péter Váry <[email protected]> ezt írta (időpont: 2026. febr.
25., Sze, 10:04):

> Dan kindly set up a dedicated public Slack channel (*#indexes)* for the
> Secondary Index discussion.
> You can find it here:
> https://apache-iceberg.slack.com/archives/C0AFDSU3EUU
> Feel free to join if you’d like to participate in the discussion or simply
> follow along.
>
> Thanks,
> Peter
>
> Péter Váry <[email protected]> ezt írta (időpont: 2026. febr.
> 24., K, 12:52):
>
>> We had an extended discussion on Slack with Dan, Steven, and Yufei about
>> where index metadata should live. In particular, whether it should be
>> stored directly in the table metadata or maintained in a dedicated index
>> catalog. I tried to capture this discussion in the Layout
>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.4oz3yd6ngr3>
>>  section
>> of the document.
>>
>> Once the decision is made, this section can be shortened, but for now it
>> is intentionally more detailed so that everyone can see the arguments that
>> were discussed and so that those who could not participate synchronously
>> can still follow and provide feedback offline.
>>
>> In short, we are currently *leaning toward storing index metadata in its
>> own catalog*, while allowing REST catalogs to expose a composite
>> endpoint that returns both table and index metadata in a single round trip.
>> This is similar in spirit to the universal load endpoint discussed in the
>> context of materialized view loading.
>>
>> Thanks,
>> Peter
>>
>> Péter Váry <[email protected]> ezt írta (időpont: 2026. febr.
>> 19., Cs, 14:06):
>>
>>> Thanks Huaxin for posting the recording and the meeting notes.
>>>
>>> I used this time to also address the questions collected during the sync:
>>>
>>>    - Collected some representative use cases. See the example use-cases
>>>    
>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.i4gt8za99j9d>
>>>  paragraph.
>>>    Anyone should feel free to suggest their own.
>>>    - Collected my thoughts about the writer requirements. See the writer
>>>    requirements
>>>    
>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.4b1p8r8nmfg1>
>>>    paragraph.
>>>    - Centralized the index maintenance related parts. See the index
>>>    maintenance
>>>    
>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.hw2nt44i0k8q>
>>>    paragraph.
>>>
>>> Might be a bit premature but created a PR
>>> <https://github.com/apache/iceberg/pull/15101> with the proposed index
>>> catalog related changes, so the ones who are more code oriented could take
>>> a look at it too.
>>>
>>> huaxin gao <[email protected]> ezt írta (időpont: 2026. febr. 19.,
>>> Cs, 5:34):
>>>
>>>> Hi Everyone,
>>>>
>>>> Here are the recording and notes from the Iceberg Index Support Sync on
>>>> 2/11.
>>>>
>>>> Recording: https://www.youtube.com/watch?v=3sFfQ0A50yk
>>>>
>>>> Notes:
>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.8041k7j2n7y3
>>>>
>>>> The meeting will move to biweekly, Mondays 9–10am PST, starting March 2.
>>>>
>>>> Since the sync, I updated the Bloom skipping index proposal
>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.5r5kl6k3fqwu>
>>>> to address the discussion questions, specifically:
>>>>
>>>>
>>>>    - Performance justification: when this helps (high-cardinality = /
>>>>    IN, many data files, high object-store latency) and how it differs from
>>>>    Parquet row-group Bloom filters (which still require opening the data 
>>>> file).
>>>>    - Cost / scalability: rough sizing (Bloom blob size per file,
>>>>    Puffin file size), the planning cost trade-off (driver index reads vs
>>>>    executor file opens), and mitigations via caching.
>>>>    - Lifecycle / maintenance: incremental production as new data files
>>>>    arrive, behavior when the index is missing/behind, and 
>>>> sharding/compaction
>>>>    plus cleanup to avoid accumulating too many small Puffin files over 
>>>> time.
>>>>    - Writer expectations: inline (optional) vs asynchronous (primary)
>>>>    index creation.
>>>>
>>>> I also implemented a Spark 4.1 POC
>>>> <https://github.com/apache/iceberg/pull/15311> and a local benchmark
>>>> to quantify both the pruning impact (plannedFiles → afterBloom) and the
>>>> index read overhead (statsFiles, statsBytes, bloomPayloadBytes) for point
>>>> predicates on high-cardinality columns. Please take a look and let me know
>>>> if you have any questions or feedback.
>>>>
>>>> Thanks,
>>>>
>>>> Huaxin
>>>>
>>>> On Tue, Feb 10, 2026 at 1:43 PM huaxin gao <[email protected]>
>>>> wrote:
>>>>
>>>>> Reminder for tomorrow's sync on Iceberg Index Support.
>>>>>
>>>>> Wednesday: Feb. 11 9:00 – 10:00am
>>>>> Time zone: America/Los_Angeles
>>>>> Google Meet joining info
>>>>> Video call link: meet.google.com/nsp-ctyr-khk
>>>>> Design doc:
>>>>>
>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2
>>>>>
>>>>> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7
>>>>>
>>>>> Thanks,
>>>>> Huaxin
>>>>>
>>>>>
>>>>> On Tue, Feb 3, 2026 at 10:52 PM Péter Váry <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thanks Huaxin and Steven for organizing this. Looking forward to meet
>>>>>> you all next week!
>>>>>>
>>>>>> On Wed, Feb 4, 2026, 02:48 Steven Wu <[email protected]> wrote:
>>>>>>
>>>>>>> We set up the dev calendar event with a new google meet link. Please
>>>>>>> ignore the link from Huaxin's original email.
>>>>>>>
>>>>>>> The dev calendar has the correct info (including the new meeting
>>>>>>> link)
>>>>>>>
>>>>>>> Iceberg Index Support Sync
>>>>>>> Wednesday, February 11 · 9:00 – 10:00am
>>>>>>> Time zone: America/Los_Angeles
>>>>>>> Google Meet joining info
>>>>>>> Video call link: https://meet.google.com/nsp-ctyr-khk
>>>>>>>
>>>>>>> On Tue, Feb 3, 2026 at 5:08 PM huaxin gao <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Sorry, I meant PST (not EST) :)
>>>>>>>> Looking forward to the discussion!
>>>>>>>>
>>>>>>>> On Tue, Feb 3, 2026 at 4:58 PM Shawn Chang <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Huaxin,
>>>>>>>>>
>>>>>>>>> Thanks for starting the sync!
>>>>>>>>>
>>>>>>>>> The meeting seems to be 9-10AM PST on the dev events calendar
>>>>>>>>> <https://calendar.google.com/calendar/u/0?cid=MzkwNWQ0OTJmMWI0NTBiYTA3MTJmMmFlNmFmYTc2ZWI3NTdmMTNkODUyMjBjYzAzYWE0NTI3ODg1YWRjNTYyOUBncm91cC5jYWxlbmRhci5nb29nbGUuY29t>,
>>>>>>>>> not EST. Maybe it's a typo?
>>>>>>>>> Otherwise, looking forward to the discussion!
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Shawn
>>>>>>>>>
>>>>>>>>> On Tue, Feb 3, 2026 at 9:18 AM huaxin gao <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>> I'd like to start a dedicated sync to discuss Iceberg Index
>>>>>>>>>> support. Here is the existing discussion thread:
>>>>>>>>>> https://lists.apache.org/thread/fzqk3jjf0xpj5m4cfqb3v4c65p0t04ty.
>>>>>>>>>>
>>>>>>>>>> To ground the discussion, here are the two proposals:
>>>>>>>>>>
>>>>>>>>>>    - Peter's proposal
>>>>>>>>>>    
>>>>>>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2>
>>>>>>>>>>  (overall
>>>>>>>>>>    index support)
>>>>>>>>>>    - My proposal
>>>>>>>>>>    
>>>>>>>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7>
>>>>>>>>>>    (bloom filter skipping index)
>>>>>>>>>>
>>>>>>>>>> Time slot: Every 3 weeks, Wednesdays at 9 AM to 10 AM EST,
>>>>>>>>>> starting next Wednesday (2/11). After FileFormat sync finishes, we 
>>>>>>>>>> plan to
>>>>>>>>>> use that slot and switch to every other Monday, 9 AM to 10 AM EST.
>>>>>>>>>>
>>>>>>>>>> Meet link: https://meet.google.com/fjn-tyze-mko
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Huaxin
>>>>>>>>>>
>>>>>>>>>

Re: Dedicated sync for Iceberg Index Support

Reply via email to