yugeeklab opened a new pull request, #8206:
URL: https://github.com/apache/paimon/pull/8206

   ### Purpose
   
   Linked issue: close #8204
   
   With deletion vectors enabled, delete records are dropped from compaction 
output at any non-zero output level, so the deletion of a key only lives in the 
deletion vector of the file holding the old row. `LookupLevels` caches lookup 
files per data file name and freezes the deletion state of build time — data 
files are immutable but their deletion vectors are not — so a cached lookup 
file can keep serving a row that has since been marked deleted.
   
   Such a stale hit corrupts every consumer of the lookup. In particular the 
lookup changelog producer uses it as the changelog BEFORE image: a re-insert 
with content identical to the pre-delete row (modulo 
`changelog-producer.row-deduplicate-ignore-fields`) is judged "no change" and 
produces no changelog, although a `-D` was already emitted by an earlier 
compaction. Downstream CDC consumers end up permanently diverged: the table 
holds a live row while the changelog stream says it was deleted.
   
   This PR validates the hit's position against the current deletion vector 
before returning it from `LookupLevels.lookup`. A deleted hit means the newest 
version of the key in the searched levels is gone; deeper levels only hold 
older versions, so the key is reported as absent rather than continuing the 
search. Hits without position information (value-only processors) keep the 
previous behaviour.
   
   ### Tests
   
   `LookupLevelsTest#testLookupRespectsDeletionVectorUpdates` exercises the 
real lookup-file cache:
   
   1. control: lookup returns the live row and warms the cache,
   2. a deletion of an unrelated position in the same file does not affect the 
live row,
   3. after marking the returned position deleted (cache not rebuilt), the same 
lookup returns null.
   
   The test fails without the fix and passes with it. Full paimon-core suite: 0 
failures (remaining errors are environmental — docker-dependent and a 
pre-existing JDK/Hadoop `Subject.getSubject` incompatibility, identical on 
master).
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to