Re: [HACKERS] strange parallel query behavior after OOM crashes

Noah Misch Sun, 09 Apr 2017 20:19:18 -0700

On Thu, Apr 06, 2017 at 03:04:13PM +0530, Kuntal Ghosh wrote:
> On Wed, Apr 5, 2017 at 6:49 PM, Amit Kapila <[email protected]> wrote:
> > On Wed, Apr 5, 2017 at 12:35 PM, Kuntal Ghosh
> > <[email protected]> wrote:
> >> On Tue, Apr 4, 2017 at 11:22 PM, Tomas Vondra
> >>> I'm probably missing something, but I don't quite understand how these
> >>> values actually survive the crash. I mean, what I observed is OOM followed
> >>> by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the
> >>> values back to 0? Or do we call ForgetBackgroundWorker() after the crash 
> >>> for
> >>> some reason?
> >> AFAICU, during crash recovery, we wait for all non-syslogger children
> >> to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
> >> StartupDataBase. While starting the startup process we check if any
> >> bgworker is scheduled for a restart.
> >>
> >
> > In general, your theory appears right, but can you check how it
> > behaves in standby server because there is a difference in how the
> > startup process behaves during master and standby startup?  In master,
> > it stops after recovery whereas in standby it will keep on running to
> > receive WAL.
> >
> While performing StartupDatabase, both master and standby server
> behave in similar way till postmaster spawns startup process.
> In master, startup process completes its job and dies. As a result,
> reaper is called which in turn calls maybe_start_bgworker().
> In standby, after getting a valid snapshot, startup process sends
> postmaster a signal to enable connections. Signal handler in
> postmaster calls maybe_start_bgworker().
> In maybe_start_bgworker(), if we find a crashed bgworker(crashed_at !=
> 0) with a NEVER RESTART flag, we call ForgetBackgroundWorker().to
> forget the bgworker process.
> 
> I've attached the patch for adding an argument in
> ForgetBackgroundWorker() to indicate a crashed situation. Based on
> that, we can take the necessary actions. I've not included the Assert
> statement in this patch.


[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Robert,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] 
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] strange parallel query behavior after OOM crashes

Reply via email to