Hi, On 2019-11-04 12:14:53 -0500, Robert Haas wrote: > If a process trying to register workers finds out that no worker slots > are available, it discovers this at the time it tries to perform the > registration. But fork() failure happens later and in a different > process. The original process just finds out that the worker is > "stopped," not whether or not it ever got started in the first > place.
Is that really true? In the case where it started and failed we except the error queue to have been attached to, and there to be either an error 'E' or a 'X' response (cf HandleParallelMessage()). It doesn't strike me as very complicated to keep track of whether any worker has sent an 'E' or not, no? I don't think we really need the Funny (?) anecdote: I learned about this part of the system recently, after I had installed some crash handler inside postgres. Turns out that that diverted, as a side-effect, SIGUSR1 to it's own signal handler. All tests in the main regression tests passed, except for ones getting stuck waiting for WaitForParallelWorkersToFinish(), which could be fixed by disabling parallelism aggressively. Took me like two hours to debug... Also, a bit sad that parallel query is the only visible failure (in the main tests) of breaking the sigusr1 infrastructure... > We certainly can't ignore a worker that managed to start and > then bombed out, because it might've already, for example, claimed a > block from a Parallel Seq Scan and not yet sent back the corresponding > tuples. We could ignore a worker that never started at all, due to > EAGAIN or whatever else, but the original process that registered the > worker has no way of finding this out. Sure, but in that case we'd have gotten either an error back from the worker, or postmaster wouldhave PANIC restarted everyone due to an unhandled error in the worker, no? > And even if you solved for all of that, I think you might still find > that it breaks some parallel query (or parallel create index) code > that expects the number of workers to change at registration time, but > not afterwards. So, that could would all need to be adjusted. Fair enough. Although I think practically nearly everything has to be ready to handle workers just being slow to start up anyway, no? There's plenty cases where we just finish before all workers are getting around to do work. Greetings, Andres Freund