Re: [DISCUSS] V4 - indexing support

Péter Váry Tue, 18 Nov 2025 05:20:58 -0800

The correct link to the spreadsheet is:
https://docs.google.com/spreadsheets/d/14cBdwsOw89ivolHtAw342YNoGmb1-Kri1E80hwWymL0


Péter Váry <[email protected]> ezt írta (időpont: 2025. nov. 18.,
K, 12:32):

> Hi Team,
>
> Do we have any progress on this topic? I’d really like to see this move
> forward.
>
> Following Sreeram’s suggestion, we should start collecting the key use
> cases we want to support with indexes. Here’s what I’ve heard so far:
>
>    - *Primary key index*
>       - Find a single or few rows by a given primary key
>       - Build the Flink “primary key → file_name, position” state by bulk
>       reading the primary key index
>    - *Secondary index*
>       - Range or min/max filtering on columns that are not part of the
>       primary key (primary sort order)
>    - *Full-text index*
>       - Term search in text columns
>    - *Vector index*
>       - Nearest or approximate nearest neighbor search
>    - *Geospatial index*
>       - Finding points within a polygon or nearest location
>
> We should identify a few critical use cases and keep the others in mind
> when designing how we store, retrieve, and use these indexes. Personally,
> I’d love to see *vector indexes in Iceberg*, enabling fast AI searches on
> Iceberg tables.
>
> For reference, I asked Copilot to collect the currently available index
> types in MSSQL, Oracle, Postgres, MySQL, and LanceDB. Here’s the list:
> https://docs.google.com/spreadsheets/d/14cBdwsOw89ivolHtAw342YNoGmb1-Kri1E80hwWymL0Thanks
> ,
>
> Peter
>
>
> Aihua Xu <[email protected]> ezt írta (időpont: 2025. nov. 2., V, 4:11):
>
>> Thanks Steven for raising this topic and giving a summary on the
>> proposals. I would like to get involved in this area.
>>
>> On Fri, Oct 31, 2025 at 4:49 PM huaxin gao <[email protected]>
>> wrote:
>>
>>> Thanks, Steven, for taking the initiative. I have previously
>>> collaborated with Miao from Adobe on secondary index and would like to
>>> continue that work.
>>>
>>> Huaxin
>>>
>>> On Fri, Oct 31, 2025 at 1:07 PM Xinli shang <[email protected]>
>>> wrote:
>>>
>>>> Thanks Steven for proposing this! This is right direction to go.
>>>> Definitely we see challenges in some cases without indexing support,
>>>> especially around equality deletes and point lookups. I would like to
>>>> contribute as well. One thing we need to be careful is that the overhead of
>>>> the index itself like memory usage, index update etc.
>>>>
>>>> Namratha, for Parquet column index, we had one for Presto
>>>> https://www.youtube.com/watch?v=fr_HdhMEa3s.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Oct 31, 2025 at 11:48 AM namratha mk <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I see the point in the doc :
>>>>>
>>>>> *The primary key index can also be useful for point lookup.*
>>>>> But to achieve the above we would need to store native file format
>>>>> metadata like parquet page index
>>>>> <https://parquet.apache.org/docs/file-format/pageindex/> in the
>>>>> primary index which helps in fetching for lookup use case. Has there been
>>>>> any talks in the community about this? Would like to get more opinions on
>>>>> this.
>>>>>
>>>>> Thanks,
>>>>> Namratha
>>>>>
>>>>> On Sat, Jul 19, 2025 at 2:39 AM Manish Malhotra <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thanks Steven,
>>>>>> +1 on this initiative, I am also interested to contribute in this
>>>>>> area.
>>>>>> As you mentioned it has a quite a breadth, my though is we can start
>>>>>> a document to  discuss different layers separately like type of indexes,
>>>>>> sync vs async, spec changes, priority of the index to be supported 
>>>>>> (instead
>>>>>> of targeting all in one go)
>>>>>>
>>>>>> Thanks,
>>>>>> Manish
>>>>>>
>>>>>> On Fri, Jul 18, 2025 at 10:41 PM Steven Wu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Vignesh, that is yet to be discussed. We haven't got to that kind of
>>>>>>> detail yet.
>>>>>>>
>>>>>>> In some cases, the index files are expected to be added along with
>>>>>>> the data files in the same commit. Maybe some cases (like secondary 
>>>>>>> index)
>>>>>>> would prefer async process.
>>>>>>>
>>>>>>> On Fri, Jul 18, 2025 at 4:11 PM Vignesh <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Are the index files for all kinds expected to be written and added
>>>>>>>> along with data files or would it be an optional async step?
>>>>>>>>
>>>>>>>> On Fri, Jul 18, 2025, 5:09 AM Péter Váry <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> > *Primary Index*: Conventionally Primary Index - just means what
>>>>>>>>> the Table's Primary storage layout/organization was. Given that 
>>>>>>>>> Iceberg
>>>>>>>>> supports Sort-order - if the Spec adds constraints to 
>>>>>>>>> derive/influence Sort
>>>>>>>>> order based on the Identifier columns - it satisfies the Primary Index
>>>>>>>>> criteria.
>>>>>>>>>
>>>>>>>>> Here is my mental model:
>>>>>>>>> - Primary Key - the unique identifier for the rows
>>>>>>>>> - Primary Key index - database index constructed on the Primary
>>>>>>>>> Key column
>>>>>>>>> - Iceberg sort order - performance optimization used to speed up
>>>>>>>>> frequent, or costly queries.
>>>>>>>>>
>>>>>>>>> The Iceberg sort order is often defined above different columns
>>>>>>>>> than the Primary Key, so I would try to avoid mixing the two concepts.
>>>>>>>>>
>>>>>>>>> > we found that an Iceberg Table based Store Secondary Index -
>>>>>>>>> provides the right balance between the ability to skip over and load 
>>>>>>>>> needed
>>>>>>>>> sections and yet provide the right performance benefits.
>>>>>>>>>
>>>>>>>>> Could you please elaborate on what "Iceberg Table based Store
>>>>>>>>> Secondary Index" means?
>>>>>>>>> Is this another Iceberg table with different columns and different
>>>>>>>>> sort order?
>>>>>>>>>
>>>>>>>>> > they want it to be in an open format, so that it can be shared
>>>>>>>>> with other engines!
>>>>>>>>>
>>>>>>>>> Wholeheartedly agreed!
>>>>>>>>>
>>>>>>>>> Thanks Steven for starting, and others for participating in the
>>>>>>>>> discussion!
>>>>>>>>> PEter
>>>>>>>>>
>>>>>>>>> Sreeram Garlapati <[email protected]> ezt írta (időpont:
>>>>>>>>> 2025. júl. 15., K, 22:12):
>>>>>>>>>
>>>>>>>>>> Thanks Steven for starting this.
>>>>>>>>>>
>>>>>>>>>> I am interested in the - Index'ing related conversations.
>>>>>>>>>>
>>>>>>>>>> Here are some preliminary thoughts:
>>>>>>>>>>
>>>>>>>>>>    1. *Primary Index*: Conventionally Primary Index - just means
>>>>>>>>>>    what the Table's Primary storage layout/organization was. Given 
>>>>>>>>>> that
>>>>>>>>>>    Iceberg supports Sort-order - if the Spec adds constraints to
>>>>>>>>>>    derive/influence Sort order based on the Identifier columns - it 
>>>>>>>>>> satisfies
>>>>>>>>>>    the Primary Index criteria.
>>>>>>>>>>    2. *Secondary Index*: Secondary Index storage calls for an
>>>>>>>>>>    efficient organization which can hold Secondary Keys along with 
>>>>>>>>>> the
>>>>>>>>>>    Location of the Row and any included columns. The index can be of 
>>>>>>>>>> many
>>>>>>>>>>    types, based on the Data. Iceberg tables are typically v.v.large. 
>>>>>>>>>> Hence,
>>>>>>>>>>    these Indexes also tend to be very large. Based on our past 1-2 
>>>>>>>>>> years of
>>>>>>>>>>    work in this space, we found that an Iceberg Table based Store 
>>>>>>>>>> Secondary
>>>>>>>>>>    Index - provides the right balance between the ability to skip 
>>>>>>>>>> over and
>>>>>>>>>>    load needed sections and yet provide the right performance 
>>>>>>>>>> benefits. This
>>>>>>>>>>    decision was also shaped by popular opinion from many of our 
>>>>>>>>>> partners &
>>>>>>>>>>    customers - as the Index computation involves a lot of 
>>>>>>>>>> computation, they
>>>>>>>>>>    want it to be in an open format, so that it can be shared with 
>>>>>>>>>> other
>>>>>>>>>>    engines!
>>>>>>>>>>    3. *Others: Full Text Search Indexes and Vector Indexes*: It
>>>>>>>>>>    is critical that we allow years of innovation in the space of 
>>>>>>>>>> Full Text
>>>>>>>>>>    Search and Vector indexes, especially with the current 
>>>>>>>>>> acceleration in AI
>>>>>>>>>>    adoption & the need it is driving on the Keyword and Similarity 
>>>>>>>>>> Search
>>>>>>>>>>    space. Given that Iceberg tables are extremely large, it is 
>>>>>>>>>> critical for us
>>>>>>>>>>    to provide a good story for Indexes that can be incrementally 
>>>>>>>>>> updated /
>>>>>>>>>>    partially loaded into memory.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Looking forward to the discussions.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Sreeram
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 15, 2025 at 9:33 AM Anurag Mantripragada
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for starting this thread, Steven!
>>>>>>>>>>>
>>>>>>>>>>> I have been interested in secondary indexing in Iceberg. There
>>>>>>>>>>> was an old proposal secondary indexing [1], we may need to 
>>>>>>>>>>> revist/redesign
>>>>>>>>>>> these structures. I agree this is a very broad topic and having 
>>>>>>>>>>> indexing
>>>>>>>>>>> structures general enough to support a wide range of use-cases will 
>>>>>>>>>>> be a
>>>>>>>>>>> key challenge.
>>>>>>>>>>>
>>>>>>>>>>> I would like to get involved any discussions related to
>>>>>>>>>>> indexing.
>>>>>>>>>>>
>>>>>>>>>>> [1] -
>>>>>>>>>>> https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit?tab=t.0
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Anurag Mantripragada
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Jul 15, 2025, at 2:37 AM, Maximilian Michels <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Thanks Steven for the summary. It would be great to extend the
>>>>>>>>>>> Iceberg spec with index files, such that they can be used for the 
>>>>>>>>>>> different
>>>>>>>>>>> use cases.
>>>>>>>>>>>
>>>>>>>>>>> For my understanding, let me further outline the different types
>>>>>>>>>>> of use cases for index files:
>>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>> Topic 1: Accelerating the resolution of equality deletes
>>>>>>>>>>> ---
>>>>>>>>>>>
>>>>>>>>>>> In its current form, equality deletes make it impossible to
>>>>>>>>>>> achieve proper merge-on-read performance in streaming reads, and 
>>>>>>>>>>> they also
>>>>>>>>>>> add a significant performance overhead in batch pipelines.
>>>>>>>>>>>
>>>>>>>>>>> Approach (a):
>>>>>>>>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
>>>>>>>>>>> Converting equality deletes to positional deletes would be a
>>>>>>>>>>> great achievement. I'm wondering though, if all engines will be 
>>>>>>>>>>> able to
>>>>>>>>>>> achieve this. There is quite some runtime complexity involved to 
>>>>>>>>>>> achieve
>>>>>>>>>>> this. If I understand correctly, the index can be bootstrapped via 
>>>>>>>>>>> table
>>>>>>>>>>> maintenance tasks, then has to be maintained by the streaming 
>>>>>>>>>>> writer.
>>>>>>>>>>>
>>>>>>>>>>> Approach (b):
>>>>>>>>>>> https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv
>>>>>>>>>>> This would boost the resolution of equality deletes during reads
>>>>>>>>>>> via indices. The indices can be built via maintenance tasks, or 
>>>>>>>>>>> directly by
>>>>>>>>>>> the writer as in (a). But how to keep the index fresh if we don't 
>>>>>>>>>>> write the
>>>>>>>>>>> index at the writers? Readers won't always be able to use an
>>>>>>>>>>> up-to-date index, making this less suitable for streaming reads.
>>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>> Topic 2: Full text search in table scans
>>>>>>>>>>> ---
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://docs.google.com/document/d/1bMACRCJBB8ycSXCFbP_BdCbFCAegRoxr2O2NXZirOmY/edit
>>>>>>>>>>> Adding full-text search would broaden Iceberg’s applicability,
>>>>>>>>>>> enabling new search use cases and making table scans far more 
>>>>>>>>>>> powerful.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Max
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jul 9, 2025 at 11:35 PM Steven Wu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Similar to other V4 threads, I am starting a thread to gauge
>>>>>>>>>>>> interest in adding index support in Iceberg V4 and gather a focus 
>>>>>>>>>>>> group in
>>>>>>>>>>>> this area.
>>>>>>>>>>>>
>>>>>>>>>>>> There have been a few discussions related to indexing recently.
>>>>>>>>>>>>
>>>>>>>>>>>>    - Me and Peter Vary are working on a proposal (WIP) to
>>>>>>>>>>>>    only write position deletes in the Flink streaming writer. It 
>>>>>>>>>>>> would need a
>>>>>>>>>>>>    primary key index to work reasonably efficiently. [1]
>>>>>>>>>>>>    - Xiaoxuan Li has a proposal to leverage index files to
>>>>>>>>>>>>    improve merge-on-read performance with equality deletes. [2]
>>>>>>>>>>>>    - pengzhiwei has a proposal to support full-text index and
>>>>>>>>>>>>    vector index. [3]
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Idea: index files*
>>>>>>>>>>>>
>>>>>>>>>>>> To support those use cases, Iceberg can add support for index
>>>>>>>>>>>> files (in addition to data files and delete files). It should be 
>>>>>>>>>>>> general
>>>>>>>>>>>> enough to support different forms of indexing.
>>>>>>>>>>>>
>>>>>>>>>>>>    - Primary key index
>>>>>>>>>>>>    - Secondary index
>>>>>>>>>>>>    - Full text index
>>>>>>>>>>>>    - Vector index
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> This email is a starting point. It is a large topic. A lot of
>>>>>>>>>>>> discussions and maturation of the ideas are needed before a formal 
>>>>>>>>>>>> proposal.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Steven
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
>>>>>>>>>>>> (WIP)
>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://lists.apache.org/thread/j4zl44g6dllzzyg9ln45pvgoosfhxqrq
>>>>>>>>>>>> [3] https://github.com/apache/iceberg/issues/12636
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>
>>>> --
>>>> Xinli Shang
>>>>
>>>

Re: [DISCUSS] V4 - indexing support

Reply via email to