yzeng1618 opened a new issue, #10581:
URL: https://github.com/apache/seatunnel/issues/10581

   ### Search before asking
   
   - [x] I had searched in the 
[feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   ## Background & Motivation
   
   In large multi-table synchronization jobs, such as whole-database sync with 
1000 tables, SeaTunnel currently behaves in a fail-fast way at several stages. 
If a small number of tables fail during source table discovery, sink 
initialization, save mode handling, or runtime writing, the whole job may fail 
immediately.
   
   This is operationally expensive for large sync jobs because a few abnormal 
tables block the remaining healthy tables.
   
   This issue proposes a framework-level capability for multi-table jobs:
   
   - isolate failed tables when the error is clearly scoped to a single table
   - allow healthy tables to continue
   - print a structured failed-table summary with table name, failure phase, 
plugin, and root cause
   - keep current fail-fast behavior as the default for backward compatibility
   
   ## Goals
   
   - Support table-level fault isolation in multi-table jobs
   - Print failed table details in logs and final exception summary
   - Keep healthy tables running when the failure is table-scoped
   - Keep `FAIL_FAST` as the default behavior
   - Do not introduce a new `JobStatus` such as `PARTIAL`
   
   ## Non-goals
   
   - Do not bypass shared/system-level failures such as source connection loss, 
checkpoint coordinator failure, cluster instability, OOM, or plugin loading 
failure
   - Do not change existing behavior for jobs that do not enable this capability
   
   ## Proposal
   
   ### 1. Failure policy
   
   Introduce an opt-in policy for multi-table jobs:
   
   `multi_table.failure_policy = FAIL_FAST | CONTINUE_OTHER_TABLES`
   
   Default remains `FAIL_FAST`.
   
   ### 2. Failed table summary
   
   When a table is isolated, SeaTunnel should record and print:
   
   - `table_path`
   - `phase` (`discovery`, `source_init`, `sink_init`, `save_mode`, 
`runtime_write`, `checkpoint`)
   - `plugin_name`
   - `exception_class`
   - `message_summary`
   - `first_failure_time`
   
   ### 3. Runtime semantics
   
   - Batch: if at least one table remains healthy, continue processing healthy 
tables; if any table failed, the final job result is still `FAILED` after the 
batch completes.
   - Streaming: while healthy tables still have active work, keep the job 
`RUNNING`; once the job exits because no healthy tables remain or the job is 
stopped, the failed-table summary should be visible in the final failure 
information.
   - Shared failures remain fail-fast.
   
   ### 4. Implementation roadmap
   
   1. Startup phase isolation  
      Catch and aggregate table-scoped exceptions during source table 
discovery, `getProducedCatalogTables()`, sink creation, and save mode handling. 
Skip only the failed tables and continue building the job if healthy tables 
remain.
   
   2. Runtime sink isolation  
      Change multi-table sub-writer handling from global fail-fast to per-table 
quarantine. After one table writer fails, stop routing new records to that 
table, exclude it from subsequent checkpoint/commit handling, and keep other 
tables running.
   
   3. Source-side attribution and observability  
      For source errors that can be clearly attributed to a table or split, 
isolate the affected table. Print failed-table details at first failure and 
print an aggregated summary at the end.
   
   ### 5. Planning diagram
   
   ```mermaid
   flowchart TD
       A[Source/Sink table discovery and init] -->|healthy table| B[Build 
runnable table set]
       A -->|table-scoped failure| C[FailedTableRegistry]
       B --> D[Runtime read/write]
       D -->|healthy table| E[Continue processing]
       D -->|table-scoped runtime failure| C
       C --> F[Print failed-table summary and aggregate root causes]
       E --> G{Any healthy table left?}
       G -- Yes --> H[Batch continues / Streaming stays RUNNING]
       G -- No --> I[Job ends as FAILED]
   ```
   
   ### 6. Acceptance criteria
   
   - With `CONTINUE_OTHER_TABLES` enabled, failures isolated to a subset of 
tables no longer terminate healthy tables immediately.
   - Logs and final error message include the failed table list and per-table 
cause.
   - `FAIL_FAST` keeps current behavior unchanged.
   - Shared/system-level failures still fail the whole job.
   
   ### Usage Scenario
   
   - A multi-table JDBC or CDC job synchronizes around 1000 tables.
   - A few tables may fail because of unsupported types, missing privileges, 
target DDL conflicts, or sink-side constraint errors.
   - Operators want healthy tables to continue instead of rerunning the whole 
job.
   - Operators also need a clear failed-table list for later fixing and replay.
   
   ### Related issues
   
   - `#10196`: row-level bypass/DLQ for transform and sink. This issue is 
complementary and focuses on table-level fault isolation in multi-table jobs.
   - `#4193`: historical multi-table parser related issue; this proposal 
addresses a broader fault-isolation problem in multi-table execution.
   
   ### Are you willing to submit a PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to