JingsongLi commented on PR #7933:
URL: https://github.com/apache/paimon/pull/7933#issuecomment-4593427624

   I reviewed the latest change and found two issues that look worth fixing 
before merge:
   
   - **[P1] Null values now stop the whole shard during generic index build.** 
In `GenericIndexTopoBuilder`, both the single-column path and multi-column path 
`break` when an indexed value is null. The Spark multi-column path does the 
same in `DefaultGlobalIndexBuilder`. Existing real writers such as 
Tantivy/Lumina treat null as “advance logical row id but do not add an indexed 
value”, so later non-null rows in the same shard still get indexed. With the 
new `break`, any rows after the first null in that shard are missing from the 
global index, which can make search/filter results incomplete. This should 
probably keep writing through the shard and let the writer preserve row-id 
alignment, or introduce a multi-column writer contract that can skip one 
logical row without ending the shard.
   
   - **[P2] `GlobalIndexScanner` only maps each field id to one multi-column 
group.** The PR allows creating multi-column indexes like `(a,b)` and `(a,c)`, 
but scanner construction stores a single `fieldId -> group` mapping. If the 
same field participates in multiple multi-column indexes, it can either throw 
`Inconsistent extraFieldIds across index files` or overwrite the previous 
group, so predicate pushdown/pre-filtering may miss one set of readers. This 
should be modeled as `fieldId -> multiple groups/readers` and merged with any 
single-column readers for the field.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to