davidzollo opened a new issue, #11049: URL: https://github.com/apache/seatunnel/issues/11049
## Background SeaTunnel currently does not provide a native OceanBase CDC source connector in the `connector-cdc` family. That leaves OceanBase users without a first-class path for: - initial snapshot + continuous incremental capture - checkpoint/restart-safe CDC ingestion - multi-table CDC jobs using the same runtime model as other SeaTunnel CDC sources - downstream schema evolution / multi-table sink integration based on SeaTunnel's CDC row model An old historical discussion exists, but it is stale and no longer gives contributors a practical implementation target. This issue is intended to replace that with a claimable engineering scope. ## Scope Add a new `connector-cdc-oceanbase` source connector under `seatunnel-connectors-v2/connector-cdc`. This issue is for the **source connector only**. ## First delivery boundary To keep the issue claimable, the first delivery should stay narrow: - support snapshot + incremental CDC for explicitly configured tables - integrate with SeaTunnel's existing CDC base abstractions where possible - support checkpoint / restore correctness - support the normal SeaTunnel multi-table CDC row contract If different OceanBase deployment modes require materially different CDC backends, the first delivery should target the path that is stable and testable in CI, and explicitly defer additional modes to follow-up work instead of blocking the connector. ## Suggested implementation approach ### 1. Choose and isolate the capture backend The implementation should start by deciding the CDC capture backend and keeping that decision isolated inside the OceanBase connector module. Practical options may include: - an OceanBase-native CDC/log-proxy client path, or - a compatible incremental-source path if the target OceanBase deployment exposes a stable change-log interface suitable for SeaTunnel's CDC model. Whichever backend is chosen, the connector should not leak backend-specific assumptions into unrelated generic CDC code unless a reusable abstraction is clearly justified. ### 2. Follow the existing CDC connector module layout The new module should be structured similarly to existing SeaTunnel CDC connectors and include at least: - source options - source config / source config factory - dialect or connector-specific source adapter - offset representation / offset factory - snapshot split planning if snapshot is supported incrementally - fetch task context / incremental reader integration - connector docs and plugin metadata registration Expected repository touch points include: - `seatunnel-connectors-v2/connector-cdc/connector-cdc-oceanbase` - `seatunnel-connectors-v2/connector-cdc/pom.xml` - `plugin-mapping.properties` - `seatunnel-dist/pom.xml` - `config/plugin_config` - `docs/en` and `docs/zh` - `seatunnel-e2e` ### 3. Keep startup semantics explicit The connector should expose SeaTunnel-owned startup semantics instead of requiring users to infer behavior through low-level passthrough properties. A reasonable first delivery is: - `initial`: read snapshot, then continue with incremental CDC - `latest` or equivalent incremental-only startup if the backend supports it safely If additional startup modes are not yet reliable for OceanBase, they should be omitted from the first delivery rather than partially implemented. ### 4. Preserve SeaTunnel CDC row semantics The connector should emit rows that fit SeaTunnel's CDC runtime expectations: - correct table identity for multi-table jobs - row kind semantics aligned with insert/update/delete handling - existing metadata population where applicable, such as database/table identifiers and CDC timing fields already used by current connectors ### 5. Checkpoint / restore correctness is mandatory The connector should not be considered complete if it only starts successfully but cannot resume safely. Implementation must verify that: - offsets/checkpoints are serializable - restore resumes from a stable OceanBase CDC position - restart does not silently skip or duplicate incremental events beyond documented guarantees ### 6. Tests and validation This issue needs more than unit tests. Suggested test layers: - option parsing / validation tests - offset serialization / restore tests - source behavior tests for snapshot + incremental flow - at least one runnable integration or E2E path If CI cannot host a full OceanBase cluster easily, the issue body or implementation notes should document the chosen validation strategy explicitly instead of silently skipping end-to-end verification. ## Suggested acceptance criteria - A new `connector-cdc-oceanbase` module is added. - The connector can read snapshot data and continue with incremental CDC for configured tables. - The connector integrates with SeaTunnel checkpoint / restore semantics correctly. - Multi-table capture is supported for explicitly configured tables. - English and Chinese docs are added. - Plugin registration and distribution packaging are updated. - At least one focused integration/E2E validation path is provided. ## Non-goals - OceanBase sink connector work. - Dynamic newly-added table discovery. - New global CDC metadata fields. - Broad CDC framework refactors not required for OceanBase support. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
