JNSimba opened a new pull request, #64013:
URL: https://github.com/apache/doris/pull/64013

   ## Proposed changes
   
   Two reliability fixes for from-to (at-least-once) CDC streaming tasks.
   
   1. **Startup timeout.** A from-to binlog task whose upstream is idle could 
block
      indefinitely in the replication startup/locate phase (no first message 
arrives,
      so the poll loop never times out). This adds a setup-phase timeout — half 
of the
      FE task timeout, passed down via `WriteRecordRequest.taskTimeoutMs` — so 
the task
      exits and commits the current offset gracefully instead of hanging. 
Snapshot
      splits are explicitly excluded so an incomplete watermark is never 
committed.
   
   2. **Release a stale reader on failure.** On task `onFail`/`cancel`, FE 
makes a
      best-effort request (`/api/releaseReader`) asking the previous backend to 
stop
      its reader while keeping the replication slot, so a reschedule to another 
backend
      does not leave two readers competing for the same slot. The RPC is 
fire-and-forget
      so it never blocks while the job lock is held.
   
   Known limitation: the release is best-effort, so a reschedule may briefly 
observe
   "replication slot is active"; this self-heals via task retry or the 
source-side
   sender timeout.
   
   ## Further comments
   
   Scoped to the from-to streaming path; snapshot and TVF paths are unaffected.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to