Liulietong opened a new issue, #7220:
URL: https://github.com/apache/paimon/issues/7220

   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/paimon/issues) 
and found no similar issues.
   
   ### Paimon version
   
   master (latest)
   
   ### Compute Engine
   
   None
   
   ### Minimal reproduce step
   
   When using `changelog-producer = lookup` with `sequence.field` configured, 
`LookupMergeFunction.pickHighLevel()` may select the wrong "old" record when 
out-of-order data arrives.
   
   **Configuration:**
   ```sql
   CREATE TABLE test (
       id INT PRIMARY KEY NOT ENFORCED,
       value INT,
       update_time BIGINT
   ) WITH (
       'changelog-producer' = 'lookup',
       'sequence.field' = 'update_time'
   );
   ```
   
   **Scenario:**
   ```
   Initial state after compaction:
     L1: (id=1, value=100, update_time=7)
     L2: (id=1, value=200, update_time=8)  ← Actually newer!
   
   New out-of-order data arrives at L0:
     L0: (id=1, value=50, update_time=6)   ← Old data arriving late
   ```
   
   **Expected behavior:**
   - `pickHighLevel()` should select L2 (update_time=8) as the "latest" 
high-level record
   - Result should reflect the record with highest sequence value
   
   **Actual behavior:**
   - `pickHighLevel()` selects L1 (level 1 < level 2) ignoring sequence.field
   - Wrong changelog is generated
   
   ### What doesn't meet your expectations?
   
   `LookupMergeFunction.pickHighLevel()` only compares level numbers, ignoring 
`sequence.field`:
   
   ```java
   // LookupMergeFunction.java:88 - Current behavior
   if (highLevel == null || kv.level() < highLevel.level()) {
       highLevel = kv;  // Always picks lowest level, ignores sequence
   }
   ```
   
   **Reproducible scenario:**
   ```java
   // When candidates contain:
   // L1: (key=1, sequence=7)  <- level 1
   // L2: (key=1, sequence=8)  <- level 2, but higher sequence (newer!)
   
   // pickHighLevel() returns L1 (because level 1 < 2)
   // But should return L2 (because sequence 8 > 7)
   ```
   
   It should use `sequence.field` comparator when configured, similar to how 
`SortMergeReaderWithMinHeap` correctly handles it at line 61-67.
   
   ### Anything else?
   
   This issue only affects `changelog-producer = lookup` scenario. Normal 
queries (Batch/Streaming Scan) and Lookup Join are not affected.
   
   I'm working on a fix and will submit a PR shortly. The PR includes a 
complete unit test to reproduce this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to