krishan1390 opened a new issue, #18663: URL: https://github.com/apache/pinot/issues/18663
## Symptom A non-committer realtime server received a `CATCH_UP` to offset `…593` and immediately failed: - Reported `_currentOffset = …291`, asked to catch up to `…593`. - First poll returned records starting at `…846` → flagged **"Message loss detected"** (`firstOffset …846 > startOffset …291`). - That overshot the target → **"Past max offset"** → segment went to ERROR. - ~30 min later, **reingestion** (a fresh consumer starting from the segment's start offset) read the *same range* cleanly up to `…593`. So the broker clearly had the data. The contradiction: a "message loss" at `…291`, yet a later read from a *much earlier* offset succeeds. ## Hypothesis Not broker data loss — a consumer-position bug, a regression introduced by #18337: 1. While consuming, the last poll pre-fetched a batch (~555 records, `…291`→`…846`) but the **segment time limit struck mid-fetch**, so `processStreamEvents` bailed at index 0 and zero records were indexed. `_currentOffset` stayed at `…291`; the Kafka consumer's internal `_nextReadOffset` advanced to `…846`. 2. On `CATCH_UP` the same consumer is reused (no recreation / re-seek). The seek-skip logic sees `startOffset (…291) == _lastFetchStartOffset` and **skips the re-seek**, polling from the advanced position `…846` instead of repositioning to `…291`. 3. The broker correctly returns records from `…846`; the `firstOffset > startOffset` check mislabels this as message loss, and the offset overshoots the catch-up target. The ~555-offset gap being *exactly one un-consumed poll batch* is the tell. Reingestion works because it uses a fresh consumer that seeks cleanly to the start — the records were never missing. The pre-#18337 logic keyed the seek on the last *record* offset vs. contiguity (`_lastFetchedOffset != startOffset - 1`), which would have re-seeked correctly here. #18337 switched the key to the previous *call's* startOffset to fix a `read_committed` flake (where re-seeking undid progress past broker-filtered aborted records); but the "consumer position is ahead of the caller's offset" condition is indistinguishable between (a) broker-filtered aborted records (keep position) and (b) a pre-fetched-but-unconsumed batch (must re-seek), and the new logic always keeps the position. **To confirm:** the build must include #18337, and the table must be `read_uncommitted` (the only isolation level under which the gap is flagged as loss). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
