wuhainan commented on issue #11013:
URL: https://github.com/apache/seatunnel/issues/11013#issuecomment-4637008473

   > I traced the current TiDB CDC reader path, and this does look like a real 
source-side bug worth tracking, not just a sink-side stall.
   > 
   > In the current implementation, 
`TiDBSourceReader.captureStreamingEvents(...)` always advances the split's 
`resolvedTs` from `cdcClient.getMaxResolvedTs()`, but actual downstream 
emission still depends on the `preWrites` / `commits` matching path and 
`flushRows(resolvedTs)`. That means the symptom you reported is plausible in 
the current code: checkpoints can keep succeeding and `resolvedTs` can keep 
moving forward even while some row changes are no longer materialized 
downstream.
   > 
   > This also looks related to 
[#8815](https://github.com/apache/seatunnel/issues/8815), but not identical. 
[#8815](https://github.com/apache/seatunnel/issues/8815) was centered on older 
TiDB-CDC-MIGRATE loss reports, while your reproduction is on 2.3.13 + Flink + 
normal TiDB-CDC and already narrows the symptom much better.
   > 
   > We should keep this open as a real bug. We have labeled it as `help 
wanted` since the reporter did not opt into a PR. A good next debugging slice 
would be to verify whether the loss happens around the `PREWRITE` / `COMMIT` 
matching path during region movement or client pull gaps, because that is the 
part that can let `resolvedTs` advance independently from emitted rows.
   > 
   > If someone wants to work on a fix, a small first PR with a deterministic 
regression test around this reader path would be very valuable.
   
   Thanks for the confirmation.
   
   I would like to add one more observation: this job was configured with 
`startup.mode = "initial"`, but it looks like the reader switched to streaming 
before the initial snapshot was fully completed.
   
   The source table had about 521,960 rows, while the target table only had 
about 91,746 rows when downstream emission stopped. The Flink source/sink 
metrics were also around 91,746 records.
   
   So it seems that only part of the initial snapshot was emitted before the 
reader entered the incremental CDC phase. I am not sure whether this is the 
same root cause as the `resolvedTs` advancing issue, but it may be related.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to