Sometime back Tom Lane has reported [1] about $Subject. I have looked into the issue and found that the problem is not only with parallel workers but with general background worker machinery as well in situations where fork or some such failure occurs. The first problem is that after we register the dynamic worker, the way to know whether the worker has started (WaitForBackgroundWorkerStartup/GetBackgroundWorkerPid) won't give the right answer if the fork failure happens. Also, in cases where the worker is marked not to start after the crash, postmaster doesn't notify the backend if it is not able to start the worker which can make the backend wait forever as it is oblivion of the fact that the worker is not started. Now, apart from these general problems of background worker machinery, parallel.c assumes that after it has registered the dynamic workers, they will start and perform their work. We need to ensure that in case, postmaster is not able to start parallel workers due to fork failure or any similar situations, backend doesn't keep on waiting forever. To fix it, before waiting for workers to finish, we can check whether the worker exists at all. Attached patch fixes these problems.
Another way to fix the parallel query related problem is that after registering the workers, the master backend should wait for workers to start before setting up different queues for them to communicate. I think that can be quite expensive. Thoughts? [1] - https://www.postgresql.org/message-id/4905.1492813...@sss.pgh.pa.us -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
fix_worker_startup_failures_v1.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers