Thanks for starting this thread, Steven! I have been interested in secondary indexing in Iceberg. There was an old proposal secondary indexing [1], we may need to revist/redesign these structures. I agree this is a very broad topic and having indexing structures general enough to support a wide range of use-cases will be a key challenge.
I would like to get involved any discussions related to indexing. [1] - https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit?tab=t.0 Thanks, Anurag Mantripragada > On Jul 15, 2025, at 2:37 AM, Maximilian Michels <m...@apache.org> wrote: > > Thanks Steven for the summary. It would be great to extend the Iceberg spec > with index files, such that they can be used for the different use cases. > > For my understanding, let me further outline the different types of use cases > for index files: > > --- > Topic 1: Accelerating the resolution of equality deletes > --- > > In its current form, equality deletes make it impossible to achieve proper > merge-on-read performance in streaming reads, and they also add a significant > performance overhead in batch pipelines. > > Approach (a): > https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/ > Converting equality deletes to positional deletes would be a great > achievement. I'm wondering though, if all engines will be able to achieve > this. There is quite some runtime complexity involved to achieve this. If I > understand correctly, the index can be bootstrapped via table maintenance > tasks, then has to be maintained by the streaming writer. > > Approach (b): https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv > This would boost the resolution of equality deletes during reads via indices. > The indices can be built via maintenance tasks, or directly by the writer as > in (a). But how to keep the index fresh if we don't write the index at the > writers? Readers won't always be able to use an up-to-date index, making this > less suitable for streaming reads. > > --- > Topic 2: Full text search in table scans > --- > > https://docs.google.com/document/d/1bMACRCJBB8ycSXCFbP_BdCbFCAegRoxr2O2NXZirOmY/edit > Adding full-text search would broaden Iceberg’s applicability, enabling new > search use cases and making table scans far more powerful. > > Cheers, > Max > > On Wed, Jul 9, 2025 at 11:35 PM Steven Wu <stevenz...@gmail.com > <mailto:stevenz...@gmail.com>> wrote: >> >> Similar to other V4 threads, I am starting a thread to gauge interest in >> adding index support in Iceberg V4 and gather a focus group in this area. >> >> There have been a few discussions related to indexing recently. >> Me and Peter Vary are working on a proposal (WIP) to only write position >> deletes in the Flink streaming writer. It would need a primary key index to >> work reasonably efficiently. [1] >> Xiaoxuan Li has a proposal to leverage index files to improve merge-on-read >> performance with equality deletes. [2] >> pengzhiwei has a proposal to support full-text index and vector index. [3] >> >> Idea: index files >> >> To support those use cases, Iceberg can add support for index files (in >> addition to data files and delete files). It should be general enough to >> support different forms of indexing. >> Primary key index >> Secondary index >> Full text index >> Vector index >> >> This email is a starting point. It is a large topic. A lot of discussions >> and maturation of the ideas are needed before a formal proposal. >> >> Thanks, >> Steven >> >> [1] >> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/ >> (WIP) >> [2] https://lists.apache.org/thread/j4zl44g6dllzzyg9ln45pvgoosfhxqrq >> [3] https://github.com/apache/iceberg/issues/12636 >> >>