[PR] Improve worker cleanup on early coordinator exit [couchdb]

via GitHub Fri, 26 Jul 2024 12:09:05 -0700


nickva opened a new pull request, #5152:
URL: https://github.com/apache/couchdb/pull/5152


   Previously, if the coordinator process is killed too quickly, before the 
stream worker cleanup process is spawned, remote workers may be left around 
waiting until the default 5 minute timeout expires.
   
   In order to reliably clean up processes in that state, need to start the 
cleaner process, with all the job references, before we start submitting them 
for execution.
   
   At first, it may seem impossible to monitor a process until after it's 
already spawned. That's true for regular processes, however rexi operates on 
plain references. For each process we spawn remotely we create a reference on 
the coordinator side, which we can then use to track that job. Those are just 
plain manually created references. Nothing stops us from creating them first, 
adding them to a cleaner process, and only then submitting them.
   
   That's exactly what this commit accomplishes:
   
     * Create a streams specific `fabric_streams:submit_jobs/4` function, which 
spawns the cleanup process early, generates worker references, and then submits 
the jobs. This way, all the existing streaming submit_jobs can be replaced 
easily in one line: fabric_util -> fabric_streams.
   
     * The cleanup process operates as previously: monitors the coordinator for 
exits, and fires off `kill_all` message to each node.
   
     * Create `rexi:cast_ref(...)` variants of `rexi:cast(...)` calls, where 
the caller specifies the references a new argument. This is what allows us to 
start the cleanup process before the even get submitted. Older calls can just 
be easily call into the `cast_ref` versions with their own created references.
   
   Since we added the new `rexi:cast_ref(...)` variants, ensure to add more 
test coverage, including the streaming logic as well. It's not 100% yet, but 
getting there.
   
   Also, the comments in `rexi.erl` were full of erldoc stanzas and we don't 
actually build erldocs anywhere, so replace them with something more helpful. 
The streaming protocol itself was never quite described anywhere, and it can 
take sometime to figure it out (at least it took me), so took the chance to 
also add a very basic, high level description of the message flow.
   
   Related: 
https://github.com/apache/couchdb/issues/5127#issuecomment-2253261222
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Improve worker cleanup on early coordinator exit [couchdb]

Reply via email to