On Fri, Mar 31, 2017 at 12:32 AM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
> On Fri, Mar 31, 2017 at 7:38 AM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>> Hi,
>> While doing some benchmarking, I've ran into a fairly strange issue with OOM
>> breaking LaunchParallelWorkers() after the restart. What I see happening is
>> this:
>> 1) a query is executed, and at the end of LaunchParallelWorkers we get
>>     nworkers=8 nworkers_launched=8
>> 2) the query does a Hash Aggregate, but ends up eating much more memory due
>> to n_distinct underestimate (see [1] from 2015 for details), and gets killed
>> by OOM
>> 3) the server restarts, the query is executed again, but this time we get in
>> LaunchParallelWorkers
>>     nworkers=8 nworkers_launched=0
>> There's nothing else running on the server, and there definitely should be
>> free parallel workers.
>> 4) The query gets killed again, and on the next execution we get
>>     nworkers=8 nworkers_launched=8
>> again, although not always. I wonder whether the exact impact depends on OOM
>> killing the leader or worker, for example.
> I don't know what's going on but I think I have seen this once or
> twice myself while hacking on test code that crashed.  I wonder if the
> DSM_CREATE_NULL_IF_MAXSEGMENTS case could be being triggered because
> the DSM control is somehow confused?
I think I've run into the same problem while working on parallelizing
plans containing InitPlans. You can reproduce that scenario by
following steps:

1. Put an Assert(0) in ParallelQueryMain(), start server and execute
any parallel query.
 In LaunchParallelWorkers, you can see
       nworkers = n nworkers_launched = n (n>0)
But, all the workers will crash because of the assert statement.
2. the server restarts automatically, initialize
BackgroundWorkerData->parallel_register_count and
BackgroundWorkerData->parallel_terminate_count in the shared memory.
After that, it calls ForgetBackgroundWorker and it increments
parallel_terminate_count. In LaunchParallelWorkers, we have the
following condition:
if ((BackgroundWorkerData->parallel_register_count -
                     BackgroundWorkerData->parallel_terminate_count) >=
DO NOT launch any parallel worker.
Hence, nworkers = n nworkers_launched = 0.

I thought because of my stupid mistake the parallel worker is
crashing, so, this is supposed to happen. Sorry for that.

Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to