[
https://issues.apache.org/jira/browse/FLINK-39315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-39315:
-----------------------------------
Labels: pull-request-available (was: )
> MySql cdc connector could get stuck in backfill binlog reading after a
> failover within snapshot phase when the MySql table is being continuously
> written
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-39315
> URL: https://issues.apache.org/jira/browse/FLINK-39315
> Project: Flink
> Issue Type: Bug
> Components: Flink CDC
> Affects Versions: cdc-3.4.0, cdc-3.5.0
> Reporter: Cong Cheng
> Priority: Major
> Labels: pull-request-available
>
> h3. Summary
> This issue is different from FLINK-39207. Even with FLINK-39207 fixed, the
> MySQL CDC source can still hang during the snapshot backfill phase when
> processing multiple snapshot splits sequentially while reusing the same
> BinaryLogClient .
> When `MySqlSourceReader` processes multiple snapshot splits sequentially
> (reusing the same `BinaryLogClient` across splits), the job can get stuck and
> hang indefinitely in `SnapshotSplitReader.pollWithBuffer()` during the
> snapshot backfill phase, waiting for `BINLOG_END` while the queue remains
> empty.
> h3. Root Cause Analysis
> # `SnapshotSplitReader.pollWithBuffer()` keeps polling
> `ChangeEventQueue.poll()` until it receives the `BINLOG_END` watermark event;
> otherwise it will wait indefinitely.
> # The MySQL CDC implementation reuses a single `BinaryLogClient` instance
> across split executions (via reusing `StatefulTaskContext` /
> `MySqlTaskContextImpl`).
> # `StatefulTaskContext.configure()` rebuilds `ChangeEventQueue` /
> `EventDispatcher` / `SignalEventDispatcher` for each split, so each split has
> a different target queue/dispatcher.
> # In `MySqlStreamingChangeEventSource.execute()`, each execution registers
> multiple `BinaryLogClient` event/lifecycle listeners (e.g. the main event
> listener, lifecycle listener, `onEvent`, debug listener), but the
> implementation does not unregister these listeners when the execution
> finishes (it only disconnects the client).
> # Therefore, listeners from previous splits accumulate and remain active
> when later splits start the backfill binlog reading.
> # During backfill of a later split, binlog events will still trigger
> callbacks of an old split’s listener/task. When the current binlog offset
> advances to a point that satisfies the old split’s stop condition (e.g.
> `currentOffset >= oldEndingOffset`, common under continuous writes), the old
> listener can:
> ## stop the shared `StoppableChangeEventSourceContext` (by calling
> `stopChangeEventSource()`), causing the *current* split’s backfill loop to
> exit prematurely; and/or
> ## dispatch `BINLOG_END` via the old `SignalEventDispatcher` into the old
> `ChangeEventQueue` (because the old listener holds the old dispatcher/queue
> created in its split’s `StatefulTaskContext.configure()`).
> # As a result, the current `pollWithBuffer()` is polling the new queue and
> never receives `BINLOG_END`, while the backfill thread has already stopped
> without surfacing an exception, leading to a deadlock/hang.
> h3. Steps to Reproduce
> # Configure a MySQL CDC Source with scan.incremental.snapshot.chunk.size set
> to a large value to ensure a snapshot split is time consuming to read;
> # Keep continuous writes on the MySql source table so binlog offsets advance;
> # Trigger a TaskManager failover while the job is in the snapshot phase;
> # Observe that the job hangs after processing the first split.
> A regression test is also included in the PR (e.g.,
> SnapshotSplitReaderTest#testMultipleSplitsWithBackfill ). It reproduces the
> hang on the buggy version if the fix code splits are commented out and passes
> with the fix applied.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)