nickva opened a new pull request, #5152:
URL: https://github.com/apache/couchdb/pull/5152
Previously, if the coordinator process is killed too quickly, before the
stream worker cleanup process is spawned, remote workers may be left around
waiting until the default 5 minute timeout expires.
In order to reliably clean up processes in that state, need to start the
cleaner process, with all the job references, before we start submitting them
for execution.
At first, it may seem impossible to monitor a process until after it's
already spawned. That's true for regular processes, however rexi operates on
plain references. For each process we spawn remotely we create a reference on
the coordinator side, which we can then use to track that job. Those are just
plain manually created references. Nothing stops us from creating them first,
adding them to a cleaner process, and only then submitting them.
That's exactly what this commit accomplishes:
* Create a streams specific `fabric_streams:submit_jobs/4` function, which
spawns the cleanup process early, generates worker references, and then submits
the jobs. This way, all the existing streaming submit_jobs can be replaced
easily in one line: fabric_util -> fabric_streams.
* The cleanup process operates as previously: monitors the coordinator for
exits, and fires off `kill_all` message to each node.
* Create `rexi:cast_ref(...)` variants of `rexi:cast(...)` calls, where
the caller specifies the references a new argument. This is what allows us to
start the cleanup process before the even get submitted. Older calls can just
be easily call into the `cast_ref` versions with their own created references.
Since we added the new `rexi:cast_ref(...)` variants, ensure to add more
test coverage, including the streaming logic as well. It's not 100% yet, but
getting there.
Also, the comments in `rexi.erl` were full of erldoc stanzas and we don't
actually build erldocs anywhere, so replace them with something more helpful.
The streaming protocol itself was never quite described anywhere, and it can
take sometime to figure it out (at least it took me), so took the chance to
also add a very basic, high level description of the message flow.
Related:
https://github.com/apache/couchdb/issues/5127#issuecomment-2253261222
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]