[ 
https://issues.apache.org/jira/browse/FLINK-39207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18062645#comment-18062645
 ] 

Yanquan Lv commented on FLINK-39207:
------------------------------------

Assigned.

> MySql cdc connector could get stuck in backfill binlog reading after a 
> failover within snapshot phase
> -----------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39207
>                 URL: https://issues.apache.org/jira/browse/FLINK-39207
>             Project: Flink
>          Issue Type: Bug
>          Components: Flink CDC
>    Affects Versions: cdc-3.1.0, cdc-3.2.0, cdc-3.1.1, cdc-3.3.0, cdc-3.2.1, 
> cdc-3.4.0, cdc-3.5.0
>            Reporter: Cong Cheng
>            Assignee: Cong Cheng
>            Priority: Major
>
> h2. Summary
> When a `MySqlSourceReader` processes multiple snapshot splits sequentially 
> using the same `SnapshotSplitReader` instance (typically after a failover in 
> snapshot phase), the reader gets stuck and hangs indefinitely (stops emitting 
> records).
> h2. Root Cause Analysis
>  # When the snapshot split is large or with poor network conditions, a 
> snapshot split will take more time to finished, while processing the current 
> snapshot split, a task failover (caused by machine down, lost connection to 
> server etc.) could happen. The task will try to recovery with the unfinished 
> snapshot split;
>  # After the recovery, the `MySqlSourceReader` will send another split 
> request since current assigned split number is less than 1 (the behaviour is 
> introduced by [this pull 
> request|https://github.com/apache/flink-cdc/pull/1927] ), thus the 
> MySqlSourceReader could be recovered with two snapshot splits;
>  # When the first split finishes, `stopCurrentTask()` is called, which 
> invokes `changeEventSourceContext.stopChangeEventSource()` . This sets the 
> internal isRunning flag of the context to false;
>  # When `submitSplit()` is called for the next split, it reuses the same 
> SnapshotSplitReader instance and the same changeEventSourceContext. However, 
> the submitSplit() method fails to reset the isRunning flag of the context 
> back to true;
>  # When the backfill task (which is MySqlBinlogSplitReadTask ) for the new 
> snapshot split starts, it checks context.isRunning() , finds it to be false , 
> and exits immediately without reading any data or sending the BINLOG_END 
> event;
>  # The main thread ( pollWithBuffer ) waits indefinitely for data or the end 
> signal from the queue, leading to a deadlock.
> h2. Steps to Reproduce
>  # Configure a MySQL CDC Source with scan.incremental.snapshot.chunk.size set 
> to a large value to ensure a snapshot split is time consuming to read;
>  # Trigger a TaskManager failover while the job is in the snapshot phase;
>  # Observe that the job hangs after processing the first split.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to