Adamyuanyuan opened a new pull request, #10208:
URL: https://github.com/apache/seatunnel/pull/10208

   
   ### Description of error reporting scenarios
   1. The job started normally on the first TaskManager:
      - `JdbcSourceSplitEnumerator` split the table into 1 split:
        - `Splitting table xxx.xxxx.`
        - `Split table xxx.xxxx into 1 splits.`
      - Then it sent `NoMoreSplitsEvent`:
        - `No more splits to assign. Sending NoMoreSplitsEvent to reader [0].`
   2. After running for a while, the TaskManager heartbeat timed out:
      - `Heartbeat of TaskManager ... timed out.`
      - The operator went from `RUNNING` to `FAILED`, and the job went from 
`RUNNING` to `RESTARTING`.
   3. Flink recovered the job on a new TaskManager container:
      - `Recovering subtask 0 to checkpoint -1 for source Source: Jdbc-Source 
to checkpoint.`
      - `Adding splits back from subtask: 0, splits count: 1`
      - `Add back splits 1 to JdbcSourceSplitEnumerator.`
      - The recovered reader re-registered and obtained the split:
        - `Register reader 0 to JdbcSourceSplitEnumerator.`
        - `Assigning 1 splits to subtask: 0`
   4. From the second start until it was manually canceled, the following 
**never appeared**:
      - `No more splits to assign. Sending NoMoreSplitsEvent ...`
      - No exception stack traces from operators either.
   5. The job was finally ended by an external cancel:
      - Job status: `RUNNING` → `CANCELLING` → `CANCELED`
      - Final YARN status: `KILLED` (diagnosis: `null`)
      - SeaTunnel reported: `Reason:Flink job executed failed` (because the 
Flink Job was CANCELED)
   
   ### Root cause
   `NoMoreSplitsEvent` is a terminal signal for bounded sources. After 
failover, a new reader instance re-registers, but the enumerator/coordinator 
component may not re-run `run()` (or may have already finished sending 
no-more). As a result, the recovered reader never receives `NoMoreSplitsEvent` 
and waits indefinitely.
   
   ### What changed in this PR
   - **Framework (Flink translation)**: record which subtasks have been 
signaled `signalNoMoreSplits(subtaskId)`, and when a reader re-registers, 
re-signal no-more to that subtask.
   - **Framework (SeaTunnel Engine)**: same idea in `SourceSplitEnumeratorTask` 
/ `SeaTunnelSplitEnumeratorContext` to ensure re-registered readers get the 
terminal signal.
   - **JDBC**: keep the enumerator semantics simple (signal no-more once at the 
end of `run()`), and add/extend unit tests (including a high-parallelism case 
where some readers receive no split but still must receive no-more).
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   - `FlinkSourceEnumeratorTest`
   - `SourceSplitEnumeratorTaskTest`
   - `JdbcSourceSplitEnumeratorTest`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to