[
https://issues.apache.org/jira/browse/FLINK-39207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18062645#comment-18062645
]
Yanquan Lv commented on FLINK-39207:
------------------------------------
Assigned.
> MySql cdc connector could get stuck in backfill binlog reading after a
> failover within snapshot phase
> -----------------------------------------------------------------------------------------------------
>
> Key: FLINK-39207
> URL: https://issues.apache.org/jira/browse/FLINK-39207
> Project: Flink
> Issue Type: Bug
> Components: Flink CDC
> Affects Versions: cdc-3.1.0, cdc-3.2.0, cdc-3.1.1, cdc-3.3.0, cdc-3.2.1,
> cdc-3.4.0, cdc-3.5.0
> Reporter: Cong Cheng
> Assignee: Cong Cheng
> Priority: Major
>
> h2. Summary
> When a `MySqlSourceReader` processes multiple snapshot splits sequentially
> using the same `SnapshotSplitReader` instance (typically after a failover in
> snapshot phase), the reader gets stuck and hangs indefinitely (stops emitting
> records).
> h2. Root Cause Analysis
> # When the snapshot split is large or with poor network conditions, a
> snapshot split will take more time to finished, while processing the current
> snapshot split, a task failover (caused by machine down, lost connection to
> server etc.) could happen. The task will try to recovery with the unfinished
> snapshot split;
> # After the recovery, the `MySqlSourceReader` will send another split
> request since current assigned split number is less than 1 (the behaviour is
> introduced by [this pull
> request|https://github.com/apache/flink-cdc/pull/1927] ), thus the
> MySqlSourceReader could be recovered with two snapshot splits;
> # When the first split finishes, `stopCurrentTask()` is called, which
> invokes `changeEventSourceContext.stopChangeEventSource()` . This sets the
> internal isRunning flag of the context to false;
> # When `submitSplit()` is called for the next split, it reuses the same
> SnapshotSplitReader instance and the same changeEventSourceContext. However,
> the submitSplit() method fails to reset the isRunning flag of the context
> back to true;
> # When the backfill task (which is MySqlBinlogSplitReadTask ) for the new
> snapshot split starts, it checks context.isRunning() , finds it to be false ,
> and exits immediately without reading any data or sending the BINLOG_END
> event;
> # The main thread ( pollWithBuffer ) waits indefinitely for data or the end
> signal from the queue, leading to a deadlock.
> h2. Steps to Reproduce
> # Configure a MySQL CDC Source with scan.incremental.snapshot.chunk.size set
> to a large value to ensure a snapshot split is time consuming to read;
> # Trigger a TaskManager failover while the job is in the snapshot phase;
> # Observe that the job hangs after processing the first split.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)