JingsongLi commented on PR #7933: URL: https://github.com/apache/paimon/pull/7933#issuecomment-4593427624
I reviewed the latest change and found two issues that look worth fixing before merge: - **[P1] Null values now stop the whole shard during generic index build.** In `GenericIndexTopoBuilder`, both the single-column path and multi-column path `break` when an indexed value is null. The Spark multi-column path does the same in `DefaultGlobalIndexBuilder`. Existing real writers such as Tantivy/Lumina treat null as “advance logical row id but do not add an indexed value”, so later non-null rows in the same shard still get indexed. With the new `break`, any rows after the first null in that shard are missing from the global index, which can make search/filter results incomplete. This should probably keep writing through the shard and let the writer preserve row-id alignment, or introduce a multi-column writer contract that can skip one logical row without ending the shard. - **[P2] `GlobalIndexScanner` only maps each field id to one multi-column group.** The PR allows creating multi-column indexes like `(a,b)` and `(a,c)`, but scanner construction stores a single `fieldId -> group` mapping. If the same field participates in multiple multi-column indexes, it can either throw `Inconsistent extraFieldIds across index files` or overwrite the previous group, so predicate pushdown/pre-filtering may miss one set of readers. This should be modeled as `fieldId -> multiple groups/readers` and merged with any single-column readers for the field. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
