Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2018-04-27 19:38:30 -0400, Bruce Momjian wrote:
> > On Fri, Apr 27, 2018 at 04:10:43PM -0700, Andres Freund wrote:
> > > On 2018-04-27 19:04:47 -0400, Bruce Momjian wrote:
> > > > On Fri, Apr 27, 2018 at 03:28:42PM -0700, Andres Freund wrote:
> > > > > - We need more aggressive error checking on close(), for ENOSPC and
> > > > >   EIO. In both cases afaics we'll have to trigger a crash recovery
> > > > >   cycle. It's entirely possible to end up in a loop on NFS etc, but I
> > > > >   don't think there's a way around that.
> > > > 
> > > > If the no-space or write failures are persistent, as you mentioned
> > > > above, what is the point of going into crash recovery --- why not just
> > > > shut down?
> > > 
> > > Well, I mentioned that as an alternative in my email. But for one we
> > > don't really have cases where we do that right now, for another we can't
> > > really differentiate between a transient and non-transient state. It's
> > > entirely possible that the admin on the system that ran out of space
> > > fixes things, clearing up the problem.
> > 
> > True, but if we get a no-space error, odds are it will not be fixed at
> > the time we are failing.  Wouldn't the administrator check that the
> > server is still running after they free the space?
> 
> I'd assume it's pretty common that those are separate teams. Given that
> we currently don't behave that way for other cases where we *already*
> can enter crash-recovery loops I don't think we need to introduce that
> here. It's far more common to enter this kind of problem with pg_xlog
> filling up the ordinary way. And that can lead to such loops.

When we crash-restart, we also go through and clean things up some, no?
Seems like that gives us the potential to end up fixing things ourselves
and allowing the crash-restart to succeed.

Consider unlogged tables, temporary tables, on-disk sorts, etc.  It's
entirely common for a bad query to run the system out of disk space (but
have a write of a regular table be what discovers the out-of-space
problem...) and if we crash-restart properly then we'd hopefully clean
things out, freeing up space, and allowing us to come back up.

Now, of course, ideally admins would set up temp tablespaces and
segregate WAL onto its own filesystem, etc, but...

Thanks!

Stephen

Attachment: signature.asc
Description: PGP signature

Reply via email to