The correct link to the spreadsheet is: https://docs.google.com/spreadsheets/d/14cBdwsOw89ivolHtAw342YNoGmb1-Kri1E80hwWymL0
Péter Váry <[email protected]> ezt írta (időpont: 2025. nov. 18., K, 12:32): > Hi Team, > > Do we have any progress on this topic? I’d really like to see this move > forward. > > Following Sreeram’s suggestion, we should start collecting the key use > cases we want to support with indexes. Here’s what I’ve heard so far: > > - *Primary key index* > - Find a single or few rows by a given primary key > - Build the Flink “primary key → file_name, position” state by bulk > reading the primary key index > - *Secondary index* > - Range or min/max filtering on columns that are not part of the > primary key (primary sort order) > - *Full-text index* > - Term search in text columns > - *Vector index* > - Nearest or approximate nearest neighbor search > - *Geospatial index* > - Finding points within a polygon or nearest location > > We should identify a few critical use cases and keep the others in mind > when designing how we store, retrieve, and use these indexes. Personally, > I’d love to see *vector indexes in Iceberg*, enabling fast AI searches on > Iceberg tables. > > For reference, I asked Copilot to collect the currently available index > types in MSSQL, Oracle, Postgres, MySQL, and LanceDB. Here’s the list: > https://docs.google.com/spreadsheets/d/14cBdwsOw89ivolHtAw342YNoGmb1-Kri1E80hwWymL0Thanks > , > > Peter > > > Aihua Xu <[email protected]> ezt írta (időpont: 2025. nov. 2., V, 4:11): > >> Thanks Steven for raising this topic and giving a summary on the >> proposals. I would like to get involved in this area. >> >> On Fri, Oct 31, 2025 at 4:49 PM huaxin gao <[email protected]> >> wrote: >> >>> Thanks, Steven, for taking the initiative. I have previously >>> collaborated with Miao from Adobe on secondary index and would like to >>> continue that work. >>> >>> Huaxin >>> >>> On Fri, Oct 31, 2025 at 1:07 PM Xinli shang <[email protected]> >>> wrote: >>> >>>> Thanks Steven for proposing this! This is right direction to go. >>>> Definitely we see challenges in some cases without indexing support, >>>> especially around equality deletes and point lookups. I would like to >>>> contribute as well. One thing we need to be careful is that the overhead of >>>> the index itself like memory usage, index update etc. >>>> >>>> Namratha, for Parquet column index, we had one for Presto >>>> https://www.youtube.com/watch?v=fr_HdhMEa3s. >>>> >>>> >>>> >>>> >>>> On Fri, Oct 31, 2025 at 11:48 AM namratha mk <[email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> I see the point in the doc : >>>>> >>>>> *The primary key index can also be useful for point lookup.* >>>>> But to achieve the above we would need to store native file format >>>>> metadata like parquet page index >>>>> <https://parquet.apache.org/docs/file-format/pageindex/> in the >>>>> primary index which helps in fetching for lookup use case. Has there been >>>>> any talks in the community about this? Would like to get more opinions on >>>>> this. >>>>> >>>>> Thanks, >>>>> Namratha >>>>> >>>>> On Sat, Jul 19, 2025 at 2:39 AM Manish Malhotra < >>>>> [email protected]> wrote: >>>>> >>>>>> Thanks Steven, >>>>>> +1 on this initiative, I am also interested to contribute in this >>>>>> area. >>>>>> As you mentioned it has a quite a breadth, my though is we can start >>>>>> a document to discuss different layers separately like type of indexes, >>>>>> sync vs async, spec changes, priority of the index to be supported >>>>>> (instead >>>>>> of targeting all in one go) >>>>>> >>>>>> Thanks, >>>>>> Manish >>>>>> >>>>>> On Fri, Jul 18, 2025 at 10:41 PM Steven Wu <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Vignesh, that is yet to be discussed. We haven't got to that kind of >>>>>>> detail yet. >>>>>>> >>>>>>> In some cases, the index files are expected to be added along with >>>>>>> the data files in the same commit. Maybe some cases (like secondary >>>>>>> index) >>>>>>> would prefer async process. >>>>>>> >>>>>>> On Fri, Jul 18, 2025 at 4:11 PM Vignesh <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Are the index files for all kinds expected to be written and added >>>>>>>> along with data files or would it be an optional async step? >>>>>>>> >>>>>>>> On Fri, Jul 18, 2025, 5:09 AM Péter Váry < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> > *Primary Index*: Conventionally Primary Index - just means what >>>>>>>>> the Table's Primary storage layout/organization was. Given that >>>>>>>>> Iceberg >>>>>>>>> supports Sort-order - if the Spec adds constraints to >>>>>>>>> derive/influence Sort >>>>>>>>> order based on the Identifier columns - it satisfies the Primary Index >>>>>>>>> criteria. >>>>>>>>> >>>>>>>>> Here is my mental model: >>>>>>>>> - Primary Key - the unique identifier for the rows >>>>>>>>> - Primary Key index - database index constructed on the Primary >>>>>>>>> Key column >>>>>>>>> - Iceberg sort order - performance optimization used to speed up >>>>>>>>> frequent, or costly queries. >>>>>>>>> >>>>>>>>> The Iceberg sort order is often defined above different columns >>>>>>>>> than the Primary Key, so I would try to avoid mixing the two concepts. >>>>>>>>> >>>>>>>>> > we found that an Iceberg Table based Store Secondary Index - >>>>>>>>> provides the right balance between the ability to skip over and load >>>>>>>>> needed >>>>>>>>> sections and yet provide the right performance benefits. >>>>>>>>> >>>>>>>>> Could you please elaborate on what "Iceberg Table based Store >>>>>>>>> Secondary Index" means? >>>>>>>>> Is this another Iceberg table with different columns and different >>>>>>>>> sort order? >>>>>>>>> >>>>>>>>> > they want it to be in an open format, so that it can be shared >>>>>>>>> with other engines! >>>>>>>>> >>>>>>>>> Wholeheartedly agreed! >>>>>>>>> >>>>>>>>> Thanks Steven for starting, and others for participating in the >>>>>>>>> discussion! >>>>>>>>> PEter >>>>>>>>> >>>>>>>>> Sreeram Garlapati <[email protected]> ezt írta (időpont: >>>>>>>>> 2025. júl. 15., K, 22:12): >>>>>>>>> >>>>>>>>>> Thanks Steven for starting this. >>>>>>>>>> >>>>>>>>>> I am interested in the - Index'ing related conversations. >>>>>>>>>> >>>>>>>>>> Here are some preliminary thoughts: >>>>>>>>>> >>>>>>>>>> 1. *Primary Index*: Conventionally Primary Index - just means >>>>>>>>>> what the Table's Primary storage layout/organization was. Given >>>>>>>>>> that >>>>>>>>>> Iceberg supports Sort-order - if the Spec adds constraints to >>>>>>>>>> derive/influence Sort order based on the Identifier columns - it >>>>>>>>>> satisfies >>>>>>>>>> the Primary Index criteria. >>>>>>>>>> 2. *Secondary Index*: Secondary Index storage calls for an >>>>>>>>>> efficient organization which can hold Secondary Keys along with >>>>>>>>>> the >>>>>>>>>> Location of the Row and any included columns. The index can be of >>>>>>>>>> many >>>>>>>>>> types, based on the Data. Iceberg tables are typically v.v.large. >>>>>>>>>> Hence, >>>>>>>>>> these Indexes also tend to be very large. Based on our past 1-2 >>>>>>>>>> years of >>>>>>>>>> work in this space, we found that an Iceberg Table based Store >>>>>>>>>> Secondary >>>>>>>>>> Index - provides the right balance between the ability to skip >>>>>>>>>> over and >>>>>>>>>> load needed sections and yet provide the right performance >>>>>>>>>> benefits. This >>>>>>>>>> decision was also shaped by popular opinion from many of our >>>>>>>>>> partners & >>>>>>>>>> customers - as the Index computation involves a lot of >>>>>>>>>> computation, they >>>>>>>>>> want it to be in an open format, so that it can be shared with >>>>>>>>>> other >>>>>>>>>> engines! >>>>>>>>>> 3. *Others: Full Text Search Indexes and Vector Indexes*: It >>>>>>>>>> is critical that we allow years of innovation in the space of >>>>>>>>>> Full Text >>>>>>>>>> Search and Vector indexes, especially with the current >>>>>>>>>> acceleration in AI >>>>>>>>>> adoption & the need it is driving on the Keyword and Similarity >>>>>>>>>> Search >>>>>>>>>> space. Given that Iceberg tables are extremely large, it is >>>>>>>>>> critical for us >>>>>>>>>> to provide a good story for Indexes that can be incrementally >>>>>>>>>> updated / >>>>>>>>>> partially loaded into memory. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Looking forward to the discussions. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Sreeram >>>>>>>>>> >>>>>>>>>> On Tue, Jul 15, 2025 at 9:33 AM Anurag Mantripragada >>>>>>>>>> <[email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks for starting this thread, Steven! >>>>>>>>>>> >>>>>>>>>>> I have been interested in secondary indexing in Iceberg. There >>>>>>>>>>> was an old proposal secondary indexing [1], we may need to >>>>>>>>>>> revist/redesign >>>>>>>>>>> these structures. I agree this is a very broad topic and having >>>>>>>>>>> indexing >>>>>>>>>>> structures general enough to support a wide range of use-cases will >>>>>>>>>>> be a >>>>>>>>>>> key challenge. >>>>>>>>>>> >>>>>>>>>>> I would like to get involved any discussions related to >>>>>>>>>>> indexing. >>>>>>>>>>> >>>>>>>>>>> [1] - >>>>>>>>>>> https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit?tab=t.0 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Anurag Mantripragada >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Jul 15, 2025, at 2:37 AM, Maximilian Michels <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Thanks Steven for the summary. It would be great to extend the >>>>>>>>>>> Iceberg spec with index files, such that they can be used for the >>>>>>>>>>> different >>>>>>>>>>> use cases. >>>>>>>>>>> >>>>>>>>>>> For my understanding, let me further outline the different types >>>>>>>>>>> of use cases for index files: >>>>>>>>>>> >>>>>>>>>>> --- >>>>>>>>>>> Topic 1: Accelerating the resolution of equality deletes >>>>>>>>>>> --- >>>>>>>>>>> >>>>>>>>>>> In its current form, equality deletes make it impossible to >>>>>>>>>>> achieve proper merge-on-read performance in streaming reads, and >>>>>>>>>>> they also >>>>>>>>>>> add a significant performance overhead in batch pipelines. >>>>>>>>>>> >>>>>>>>>>> Approach (a): >>>>>>>>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/ >>>>>>>>>>> Converting equality deletes to positional deletes would be a >>>>>>>>>>> great achievement. I'm wondering though, if all engines will be >>>>>>>>>>> able to >>>>>>>>>>> achieve this. There is quite some runtime complexity involved to >>>>>>>>>>> achieve >>>>>>>>>>> this. If I understand correctly, the index can be bootstrapped via >>>>>>>>>>> table >>>>>>>>>>> maintenance tasks, then has to be maintained by the streaming >>>>>>>>>>> writer. >>>>>>>>>>> >>>>>>>>>>> Approach (b): >>>>>>>>>>> https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv >>>>>>>>>>> This would boost the resolution of equality deletes during reads >>>>>>>>>>> via indices. The indices can be built via maintenance tasks, or >>>>>>>>>>> directly by >>>>>>>>>>> the writer as in (a). But how to keep the index fresh if we don't >>>>>>>>>>> write the >>>>>>>>>>> index at the writers? Readers won't always be able to use an >>>>>>>>>>> up-to-date index, making this less suitable for streaming reads. >>>>>>>>>>> >>>>>>>>>>> --- >>>>>>>>>>> Topic 2: Full text search in table scans >>>>>>>>>>> --- >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://docs.google.com/document/d/1bMACRCJBB8ycSXCFbP_BdCbFCAegRoxr2O2NXZirOmY/edit >>>>>>>>>>> Adding full-text search would broaden Iceberg’s applicability, >>>>>>>>>>> enabling new search use cases and making table scans far more >>>>>>>>>>> powerful. >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Max >>>>>>>>>>> >>>>>>>>>>> On Wed, Jul 9, 2025 at 11:35 PM Steven Wu <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Similar to other V4 threads, I am starting a thread to gauge >>>>>>>>>>>> interest in adding index support in Iceberg V4 and gather a focus >>>>>>>>>>>> group in >>>>>>>>>>>> this area. >>>>>>>>>>>> >>>>>>>>>>>> There have been a few discussions related to indexing recently. >>>>>>>>>>>> >>>>>>>>>>>> - Me and Peter Vary are working on a proposal (WIP) to >>>>>>>>>>>> only write position deletes in the Flink streaming writer. It >>>>>>>>>>>> would need a >>>>>>>>>>>> primary key index to work reasonably efficiently. [1] >>>>>>>>>>>> - Xiaoxuan Li has a proposal to leverage index files to >>>>>>>>>>>> improve merge-on-read performance with equality deletes. [2] >>>>>>>>>>>> - pengzhiwei has a proposal to support full-text index and >>>>>>>>>>>> vector index. [3] >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *Idea: index files* >>>>>>>>>>>> >>>>>>>>>>>> To support those use cases, Iceberg can add support for index >>>>>>>>>>>> files (in addition to data files and delete files). It should be >>>>>>>>>>>> general >>>>>>>>>>>> enough to support different forms of indexing. >>>>>>>>>>>> >>>>>>>>>>>> - Primary key index >>>>>>>>>>>> - Secondary index >>>>>>>>>>>> - Full text index >>>>>>>>>>>> - Vector index >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> This email is a starting point. It is a large topic. A lot of >>>>>>>>>>>> discussions and maturation of the ideas are needed before a formal >>>>>>>>>>>> proposal. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Steven >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/ >>>>>>>>>>>> (WIP) >>>>>>>>>>>> [2] >>>>>>>>>>>> https://lists.apache.org/thread/j4zl44g6dllzzyg9ln45pvgoosfhxqrq >>>>>>>>>>>> [3] https://github.com/apache/iceberg/issues/12636 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>> >>>> -- >>>> Xinli Shang >>>> >>>
