wuhainan commented on issue #11013:
URL: https://github.com/apache/seatunnel/issues/11013#issuecomment-4641280555
> That additional observation is useful.
>
> I checked the current reader flow more closely, and I do not see an
intentional handoff that marks the snapshot as finished early in the normal
path. In `INITIAL` mode, the reader only flips `snapshotCompleted` after
`snapshotEvents(...)` returns, and `resolvedTs` is only advanced after the
snapshot scan reaches the end of the current split.
>
> So the `~91,746` ceiling you observed is more likely a symptom of the same
underlying reader bug family rather than a separate "expected switch to
streaming" path. In other words, this still looks like either:
>
> 1. the snapshot phase exited prematurely for some split, or
> 2. the reader reached streaming with an already incomplete materialization
state and then continued advancing `resolvedTs`.
>
> The most useful next evidence, if you still have it, would be:
>
> 1. the startup log lines around `[splitId] Snapshot events start.` and the
first `Capture streaming event from resolvedTs` for the same split
> 2. how many source splits were assigned for that table in the job
>
> That will let us distinguish "snapshot never fully drained" from
"streaming advanced after a partial snapshot" much more cleanly. Either way,
this still looks like a real TiDB-CDC reader bug worth keeping open.
I started a new isolated reproduction job and collected cleaner evidence.
Job:
- Application: `application_1756448821197_5356`
- Job name: `tidb2tidb_deposit_applications`
- SeaTunnel connector: `connector-cdc-tidb-2.3.13`
- TiKV client: `tikv-client-java-3.2.0`
- `startup.mode = initial`
- Flink source parallelism: `1`
The source table row count was:
alpha_online.deposit_applications: 522,625 rows
The target table stopped at:
target table: 93,354 rows
The Flink metrics also stopped at exactly the same number:
Source numRecordsOut = 93354
Sink numRecordsIn = 93354
SourceReceivedQPS = 0.0
SinkWriteQPS = 0.0
Sink numRecordsSendErrors = 0
backPressuredTimeMsPerSecond = 0
idleTimeMsPerSecond = 1000
I queried the same metrics twice and the values stayed at `93354`, so the
job was no longer emitting rows downstream.
Startup log evidence:
2026-06-07 10:46:40,705 INFO
org.apache.seatunnel.connectors.seatunnel.cdc.tidb.source.reader.TiDBSourceReader
[] - [alpha_online:deposit_applications:<ByteString@446b0db2 size=19
contents="t\200\000\000\000\000\000e\377_r\000\000\000\000\000\000\000\000">-<ByteString@2828c34c
size=19
contents="t\200\000\000\000\000\000e\377_s\000\000\000\000\000\000\000\000">]
Snapshot events start.
2026-06-07 10:46:41,460 WARN
org.apache.seatunnel.shade.com.zaxxer.hikari.HikariConfig [] - HikariPool-1
- idleTimeout has been set but has no effect because the pool is operating as a
fixed size pool.
2026-06-07 10:46:41,460 INFO
org.apache.seatunnel.shade.com.zaxxer.hikari.HikariDataSource [] - HikariPool-1
- Starting...
2026-06-07 10:46:41,474 INFO
org.apache.seatunnel.shade.com.zaxxer.hikari.HikariDataSource [] - HikariPool-1
- Start completed.
2026-06-07 10:46:59,162 INFO
org.apache.seatunnel.connectors.seatunnel.cdc.tidb.source.reader.TiDBSourceReader
[] - Capture streaming event from resolvedTs:466826140257353737
2026-06-07 10:46:59,171 INFO org.tikv.cdc.CDCClient
[] - remove regions: []
2026-06-07 10:46:59,171 INFO org.tikv.cdc.CDCClient
[] - add regions: [{Region[1485688] ConfVer[313901]
Version[12789] Store[177965148]
KeyRange[t\200\000\000\000\000\000e\377_r\200\000\000\000\000\003Z\247]:[t\200\000\000\000\000\000e\377_r\200\000\000\000\000\f\206\233]},
{Region[1749236] ConfVer[313949] Version[12781] Store[177965148]
KeyRange[t\200\000\000\000\000\000e\377_r\200\000\000\000\00029\351]:[t\200\000\000\000\000\000f\001_i\200\000\000\000\000\000\000\002\000\003\200\000\000\000\000\000C\225]},
{Region[164304772] ConfVer[313907] Version[12784] Store[177965147]
KeyRange[t\200\000\000\000\000\000e\377_i\200\000\000\000\000\000\000\004\004\000\000\027\365|\305\270\375\003\200\000\000\000\000\037!\316]:[t\200\000\000\000\000\000e\377_r\200\000\000\000\000\003Z\247]},
{Region[164304978] ConfVer[313919] Version[12781] Store[164454270]
KeyRange[t\200\000\000\000\000\000e\377_r\200\000\000\000\000\f\206\233]:[t\200\000\000\000\000\000e
\377_r\200\000\000\000\000\025\351\000]}, {Region[575192972] ConfVer[313955]
Version[12781] Store[177965147]
KeyRange[t\200\000\000\000\000\000e\377_r\200\000\000\000\000\025\351\000]:[t\200\000\000\000\000\000e\377_r\200\000\000\000\00029\351]}],
timestamp: 466826140257353737
2026-06-07 10:46:59,175 INFO org.tikv.cdc.RegionCDCClient
[] - start streaming region: 1485688, running: true
2026-06-07 10:46:59,181 INFO org.tikv.cdc.RegionCDCClient
[] - start streaming region: 1749236, running: true
2026-06-07 10:46:59,181 INFO org.tikv.cdc.RegionCDCClient
[] - start streaming region: 164304772, running: true
2026-06-07 10:46:59,181 INFO org.tikv.cdc.RegionCDCClient
[] - start streaming region: 164304978, running: true
2026-06-07 10:46:59,181 INFO org.tikv.cdc.RegionCDCClient
[] - start streaming region: 575192972, running: true
2026-06-07 10:46:59,182 INFO org.tikv.cdc.CDCClient
[] - keyRange applied
2026-06-07 10:46:59,282 INFO
org.apache.seatunnel.connectors.seatunnel.cdc.tidb.source.reader.TiDBSourceReader
[] - Capture streaming event next resolvedTs:466826140257353737
2026-06-07 10:46:59,284 INFO
org.apache.seatunnel.connectors.seatunnel.cdc.tidb.source.reader.TiDBSourceReader
[] - Capture streaming event from resolvedTs:466826140257353737
2026-06-07 10:46:59,384 INFO
org.apache.seatunnel.connectors.seatunnel.cdc.tidb.source.reader.TiDBSourceReader
[] - Capture streaming event next resolvedTs:466826140257353737
2026-06-07 10:47:00,385 INFO
org.apache.seatunnel.connectors.seatunnel.cdc.tidb.source.reader.TiDBSourceReader
[] - Capture streaming event from resolvedTs:466826140257353737
2026-06-07 10:47:00,386 INFO org.tikv.cdc.CDCClient
[] - handle resolvedTs: 466826145175175177, regionId: 575192972
So the reader entered streaming about 18 seconds after `Snapshot events
start`, while only about `93,354 / 522,625` rows were materialized downstream.
For the source split count:
- The Flink execution plan shows source parallelism was `1`.
- The TaskManager log contains only one occurrence of `Snapshot events
start`.
- The 5 regions shown in the CDCClient log are TiKV regions watched by
CDCClient for that one split, not SeaTunnel source splits.
After that, `resolvedTs` kept advancing:
2026-06-07 10:56:55,574 INFO TiDBSourceReader - Capture streaming event
next resolvedTs:466826177279950865
2026-06-07 10:56:57,574 INFO TiDBSourceReader - Capture streaming event
next resolvedTs:466826177355972613
Checkpoints also kept succeeding:
2026-06-07 10:56:29,604 INFO CheckpointCoordinator - Completed
checkpoint 20
2026-06-07 10:56:59,591 INFO CheckpointCoordinator - Completed
checkpoint 21
This looks like a reproducible case where the reader entered the streaming
phase before the initial snapshot was fully materialized downstream.
I still cannot prove the exact internal reason from logs alone because there
is no explicit `Snapshot events finished` log or snapshot row count, but the
evidence strongly suggests that the snapshot phase exited prematurely for the
split, then the reader continued advancing `resolvedTs` in streaming mode.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]