JNSimba opened a new pull request, #64013:
URL: https://github.com/apache/doris/pull/64013
## Proposed changes
Two reliability fixes for from-to (at-least-once) CDC streaming tasks.
1. **Startup timeout.** A from-to binlog task whose upstream is idle could
block
indefinitely in the replication startup/locate phase (no first message
arrives,
so the poll loop never times out). This adds a setup-phase timeout — half
of the
FE task timeout, passed down via `WriteRecordRequest.taskTimeoutMs` — so
the task
exits and commits the current offset gracefully instead of hanging.
Snapshot
splits are explicitly excluded so an incomplete watermark is never
committed.
2. **Release a stale reader on failure.** On task `onFail`/`cancel`, FE
makes a
best-effort request (`/api/releaseReader`) asking the previous backend to
stop
its reader while keeping the replication slot, so a reschedule to another
backend
does not leave two readers competing for the same slot. The RPC is
fire-and-forget
so it never blocks while the job lock is held.
Known limitation: the release is best-effort, so a reschedule may briefly
observe
"replication slot is active"; this self-heals via task retry or the
source-side
sender timeout.
## Further comments
Scoped to the from-to streaming path; snapshot and TVF paths are unaffected.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]