[ 
https://issues.apache.org/jira/browse/FLINK-39207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cong Cheng updated FLINK-39207:
-------------------------------
    Description: 
h2. Summary

When a `MySqlSourceReader` processes multiple snapshot splits sequentially 
using the same `SnapshotSplitReader` instance (typically after a failover in 
snapshot phase), the reader gets stuck and hangs indefinitely (stops emitting 
records).
h2. Root Cause Analysis
 # When the snapshot split is large or with poor network conditions, a snapshot 
split will take more time to finished, while processing the current snapshot 
split, a task failover (caused by machine down, lost connection to server etc.) 
could happen. The task will try to recovery with the unfinished snapshot split;
 # After the recovery, the `MySqlSourceReader` will send another split request 
since current assigned split number is less than 1 (the behaviour is introduced 
by [this pull request|https://github.com/apache/flink-cdc/pull/1927] ), thus 
the MySqlSourceReader could be recovered with two snapshot splits;
 # When the first split finishes, `stopCurrentTask()` is called, which invokes 
`changeEventSourceContext.stopChangeEventSource()` . This sets the internal 
isRunning flag of the context to false;
 # When `submitSplit()` is called for the next split, it reuses the same 
SnapshotSplitReader instance and the same changeEventSourceContext. However, 
the submitSplit() method fails to reset the isRunning flag of the context back 
to true;
 # When the backfill task (which is MySqlBinlogSplitReadTask ) for the new 
snapshot split starts, it checks context.isRunning() , finds it to be false , 
and exits immediately without reading any data or sending the BINLOG_END event;
 # The main thread ( pollWithBuffer ) waits indefinitely for data or the end 
signal from the queue, leading to a deadlock.

h2. Steps to Reproduce
 # Configure a MySQL CDC Source with scan.incremental.snapshot.chunk.size set 
to a large value to ensure a snapshot split is time consuming to read;
 # Trigger a TaskManager failover while the job is in the snapshot phase;
 # Observe that the job hangs after processing the first split.

 

 

  was:
h2. Summary

When a `MySqlSourceReader` processes multiple snapshot splits sequentially 
using the same `SnapshotSplitReader` instance (typically after a failover in 
snapshot phase), the reader gets stuck and hangs indefinitely (stops emitting 
records).
h2. Problem Statement


> MySql cdc connector could get stuck in backfill binlog reading after a 
> failover within snapshot phase
> -----------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39207
>                 URL: https://issues.apache.org/jira/browse/FLINK-39207
>             Project: Flink
>          Issue Type: Bug
>          Components: Flink CDC
>    Affects Versions: cdc-3.1.0, cdc-3.2.0, cdc-3.1.1, cdc-3.3.0, cdc-3.2.1, 
> cdc-3.4.0, cdc-3.5.0
>            Reporter: Cong Cheng
>            Priority: Major
>
> h2. Summary
> When a `MySqlSourceReader` processes multiple snapshot splits sequentially 
> using the same `SnapshotSplitReader` instance (typically after a failover in 
> snapshot phase), the reader gets stuck and hangs indefinitely (stops emitting 
> records).
> h2. Root Cause Analysis
>  # When the snapshot split is large or with poor network conditions, a 
> snapshot split will take more time to finished, while processing the current 
> snapshot split, a task failover (caused by machine down, lost connection to 
> server etc.) could happen. The task will try to recovery with the unfinished 
> snapshot split;
>  # After the recovery, the `MySqlSourceReader` will send another split 
> request since current assigned split number is less than 1 (the behaviour is 
> introduced by [this pull 
> request|https://github.com/apache/flink-cdc/pull/1927] ), thus the 
> MySqlSourceReader could be recovered with two snapshot splits;
>  # When the first split finishes, `stopCurrentTask()` is called, which 
> invokes `changeEventSourceContext.stopChangeEventSource()` . This sets the 
> internal isRunning flag of the context to false;
>  # When `submitSplit()` is called for the next split, it reuses the same 
> SnapshotSplitReader instance and the same changeEventSourceContext. However, 
> the submitSplit() method fails to reset the isRunning flag of the context 
> back to true;
>  # When the backfill task (which is MySqlBinlogSplitReadTask ) for the new 
> snapshot split starts, it checks context.isRunning() , finds it to be false , 
> and exits immediately without reading any data or sending the BINLOG_END 
> event;
>  # The main thread ( pollWithBuffer ) waits indefinitely for data or the end 
> signal from the queue, leading to a deadlock.
> h2. Steps to Reproduce
>  # Configure a MySQL CDC Source with scan.incremental.snapshot.chunk.size set 
> to a large value to ensure a snapshot split is time consuming to read;
>  # Trigger a TaskManager failover while the job is in the snapshot phase;
>  # Observe that the job hangs after processing the first split.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to