yzeng1618 opened a new issue, #10581: URL: https://github.com/apache/seatunnel/issues/10581
### Search before asking - [x] I had searched in the [feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22) and found no similar feature requirement. ### Description ## Background & Motivation In large multi-table synchronization jobs, such as whole-database sync with 1000 tables, SeaTunnel currently behaves in a fail-fast way at several stages. If a small number of tables fail during source table discovery, sink initialization, save mode handling, or runtime writing, the whole job may fail immediately. This is operationally expensive for large sync jobs because a few abnormal tables block the remaining healthy tables. This issue proposes a framework-level capability for multi-table jobs: - isolate failed tables when the error is clearly scoped to a single table - allow healthy tables to continue - print a structured failed-table summary with table name, failure phase, plugin, and root cause - keep current fail-fast behavior as the default for backward compatibility ## Goals - Support table-level fault isolation in multi-table jobs - Print failed table details in logs and final exception summary - Keep healthy tables running when the failure is table-scoped - Keep `FAIL_FAST` as the default behavior - Do not introduce a new `JobStatus` such as `PARTIAL` ## Non-goals - Do not bypass shared/system-level failures such as source connection loss, checkpoint coordinator failure, cluster instability, OOM, or plugin loading failure - Do not change existing behavior for jobs that do not enable this capability ## Proposal ### 1. Failure policy Introduce an opt-in policy for multi-table jobs: `multi_table.failure_policy = FAIL_FAST | CONTINUE_OTHER_TABLES` Default remains `FAIL_FAST`. ### 2. Failed table summary When a table is isolated, SeaTunnel should record and print: - `table_path` - `phase` (`discovery`, `source_init`, `sink_init`, `save_mode`, `runtime_write`, `checkpoint`) - `plugin_name` - `exception_class` - `message_summary` - `first_failure_time` ### 3. Runtime semantics - Batch: if at least one table remains healthy, continue processing healthy tables; if any table failed, the final job result is still `FAILED` after the batch completes. - Streaming: while healthy tables still have active work, keep the job `RUNNING`; once the job exits because no healthy tables remain or the job is stopped, the failed-table summary should be visible in the final failure information. - Shared failures remain fail-fast. ### 4. Implementation roadmap 1. Startup phase isolation Catch and aggregate table-scoped exceptions during source table discovery, `getProducedCatalogTables()`, sink creation, and save mode handling. Skip only the failed tables and continue building the job if healthy tables remain. 2. Runtime sink isolation Change multi-table sub-writer handling from global fail-fast to per-table quarantine. After one table writer fails, stop routing new records to that table, exclude it from subsequent checkpoint/commit handling, and keep other tables running. 3. Source-side attribution and observability For source errors that can be clearly attributed to a table or split, isolate the affected table. Print failed-table details at first failure and print an aggregated summary at the end. ### 5. Planning diagram ```mermaid flowchart TD A[Source/Sink table discovery and init] -->|healthy table| B[Build runnable table set] A -->|table-scoped failure| C[FailedTableRegistry] B --> D[Runtime read/write] D -->|healthy table| E[Continue processing] D -->|table-scoped runtime failure| C C --> F[Print failed-table summary and aggregate root causes] E --> G{Any healthy table left?} G -- Yes --> H[Batch continues / Streaming stays RUNNING] G -- No --> I[Job ends as FAILED] ``` ### 6. Acceptance criteria - With `CONTINUE_OTHER_TABLES` enabled, failures isolated to a subset of tables no longer terminate healthy tables immediately. - Logs and final error message include the failed table list and per-table cause. - `FAIL_FAST` keeps current behavior unchanged. - Shared/system-level failures still fail the whole job. ### Usage Scenario - A multi-table JDBC or CDC job synchronizes around 1000 tables. - A few tables may fail because of unsupported types, missing privileges, target DDL conflicts, or sink-side constraint errors. - Operators want healthy tables to continue instead of rerunning the whole job. - Operators also need a clear failed-table list for later fixing and replay. ### Related issues - `#10196`: row-level bypass/DLQ for transform and sink. This issue is complementary and focuses on table-level fault isolation in multi-table jobs. - `#4193`: historical multi-table parser related issue; this proposal addresses a broader fault-isolation problem in multi-table execution. ### Are you willing to submit a PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
