On Mon, Apr 9, 2018 at 3:13 PM, Andres Freund <and...@anarazel.de> wrote:
> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
> absurd, as is some of the proposed ways this is all supposed to
> work. But I think the case we're discussing is much closer to a near
> irresolvable corner case than anything else.

Well, I admit that I wasn't entirely serious about that email, but I
wasn't entirely not-serious either.  If you can't find reliably find
out whether the contents of the file on disk are the same as the
contents that the kernel is giving you when you call read(), then you
are going to have a heck of a time building a reliable system.  If the
kernel developers are determined to insist on these semantics (and,
admittedly, I don't know whether that's the case - I've only read
Anthony's remarks), then I don't really see what we can do except give
up on buffered I/O (or on Linux).

> We're talking about the storage layer returning an irresolvable
> error. You're hosed even if we report it properly.  Yes, it'd be nice if
> we could report it reliably.  But that doesn't change the fact that what
> we're doing is ensuring that data is safely fsynced unless storage
> fails, in which case it's not safely fsynced anyway.

I think that reliable error reporting is more than "nice" -- I think
it's essential.  The only argument for the current Linux behavior that
has been so far advanced on this thread, at least as far as I can see,
is that if it kept retrying the buffers forever, it would be pointless
and might run the machine out of memory, so we might as well discard
them.  But previous comments have already illustrated that the kernel
is not really up against a wall there -- it could put individual
inodes into a permanent failure state when it discards their dirty
data, as you suggested, or it could do what others have suggested, and
what I think is better, which is to put the whole filesystem into a
permanent failure state that can be cleared by remounting the FS.
That could be done on an as-needed basis -- if the number of dirty
buffers you're holding onto for some filesystem becomes too large, put
the filesystem into infinite-fail mode and discard them all.  That
behavior would be pretty easy for administrators to understand and
would resolve the entire problem here provided that no PostgreSQL
processes survived the eventual remount.

I also don't really know what we mean by an "unresolvable" error.  If
the drive is beyond all hope, then it doesn't really make sense to
talk about whether the database stored on it is corrupt.  In general
we can't be sure that we'll even get an error - e.g. the system could
be idle and the drive could be on fire.  Maybe this is the case you
meant by "it'd be nice if we could report it reliably".  But at least
in my experience, that's typically not what's going on.  You get some
I/O errors and so you remount the filesystem, or reboot, or rebuild
the array, or ... something.  And then the errors go away and, at that
point, you want to run recovery and continue using your database.  In
this scenario, it matters *quite a bit* what the error reporting was
like during the period when failures were occurring.  In particular,
if the database was allowed to think that it had successfully
checkpointed when it didn't, you're going to start recovery from the
wrong place.

I'm going to shut up now because I'm telling you things that you
obviously already know, but this doesn't sound like a "near
irresolvable corner case".  When the storage goes bonkers, either
PostgreSQL and the kernel can interact in such a way that a checkpoint
can succeed without all of the relevant data getting persisted, or
they don't.  It sounds like right now they do, and I'm not really
clear that we have a reasonable idea how to fix that.  It does not
sound like a PANIC is sufficient.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply via email to