steFaiz opened a new issue, #7079: URL: https://github.com/apache/paimon/issues/7079
### Search before asking - [x] I searched in the [issues](https://github.com/apache/paimon/issues) and found nothing similar. ### Motivation Currently, Paimon indexes are mainly designed for historical partitioned data. On the other hand, Data Evolution (DE) tables are built on top of an append-only table model. As a result, the current index implementation does not yet take deletes or updates into account. However, DE tables are not strictly append-only. With MERGE INTO, we can update column values. If a column is updated after a global index has already been built for the affected data, but the index is not updated accordingly (or the corresponding index entries are not invalidated/removed), subsequent index-based queries may return incorrect results. ### Solution ## Introduce DeletedRowIds for index In Paimon, the index is currently only used as a coarse pre-filter, and we still run a full filtering pass afterwards. Based on this, we could simply persist DeletedRanges in the data file, and after the normal index lookup finishes, apply an OR operation between the index result and the DeletedRanges to account for deleted/invalidated entries. As illustrated below: <img width="1068" height="768" alt="Image" src="https://github.com/user-attachments/assets/d536ee05-8454-4305-a5e8-709d2931c761" /> The key points are: 1. During MERGE INTO, for any data files being modified, add the corresponding row ranges directly into DeletedRowIDs. 2. When an index update/rebuild happens, remove the row ranges covered by that update/rebuild from DeletedRowIDs. 3. Accordingly, IndexedSplitScan should be able to decide—based on the input row-range—whether to push the row-range down to the file format reader, or to fall back to a full scan. ## Introduce options for Merge Into As the [comment](https://github.com/apache/paimon/pull/7028#pullrequestreview-3655396693), we could add an option for merge into, to control the action on updating indexed columns: 1. THROWS_AN_ERROR: perform a partition-level check and fail the commit with an error. 2. DROP_PARTITION_INDEX: in the same commit, drop the index files for all partitions that were modified. 3. DROP_FILE_INDEX: mark the row ranges of the affected data files as deleted/invalidated (as described above). 4. UPDATE_INDEX: add an index-update operator in the downstream MERGE INTO pipeline to update the index incrementally. ### Anything else? I think we could introduce THROWS_AN_ERROR and DROP_PARTITION_INDEX first to fix the potential data-index inconsistency. The detailed design to solve index-partial-deletion and index-update can be discussed further. ### Are you willing to submit a PR? - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
