yugeeklab opened a new pull request, #8207:
URL: https://github.com/apache/paimon/pull/8207

   ### Purpose
   
   Linked issue: close #8205
   
   `PaimonMicroBatchStream#planInputPartitions` clamped the checkpointed start 
offset up to `initOffset` whenever it compared lower. `initOffset` is 
recomputed from the current table state on every restart, so with scan modes 
like `latest-full` it always points at the current snapshot with 
`scanSnapshot=true`. Any restarted query therefore dropped its valid 
checkpointed position, silently skipped the changelog gap, and re-emitted the 
entire snapshot as `+I` rows.
   
   This PR falls back to `initOffset` only when the checkpointed snapshot has 
actually expired (older than `earliestSnapshotId`); otherwise the query resumes 
from the checkpointed offset as-is. A warning is logged when the fallback is 
taken.
   
   ### Tests
   
   Verified against a production 5-minute-trigger streaming query (~300k-row 
source): with the fix the first batch after a restart resumes from the 
checkpointed offset and reads only the downtime changelog; without it, restarts 
produced one empty batch followed by a full-snapshot re-emission (offset WAL 
evidence in #8205).
   
   I did not find an existing harness for restart simulation of 
`PaimonMicroBatchStream` unit-side; happy to add one if maintainers can point 
at a preferred pattern.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to