On Mon, Apr 9, 2018 at 3:13 PM, Andres Freund <and...@anarazel.de> wrote: > Let's lower the pitchforks a bit here. Obviously a grand rewrite is > absurd, as is some of the proposed ways this is all supposed to > work. But I think the case we're discussing is much closer to a near > irresolvable corner case than anything else.
Well, I admit that I wasn't entirely serious about that email, but I wasn't entirely not-serious either. If you can't find reliably find out whether the contents of the file on disk are the same as the contents that the kernel is giving you when you call read(), then you are going to have a heck of a time building a reliable system. If the kernel developers are determined to insist on these semantics (and, admittedly, I don't know whether that's the case - I've only read Anthony's remarks), then I don't really see what we can do except give up on buffered I/O (or on Linux). > We're talking about the storage layer returning an irresolvable > error. You're hosed even if we report it properly. Yes, it'd be nice if > we could report it reliably. But that doesn't change the fact that what > we're doing is ensuring that data is safely fsynced unless storage > fails, in which case it's not safely fsynced anyway. I think that reliable error reporting is more than "nice" -- I think it's essential. The only argument for the current Linux behavior that has been so far advanced on this thread, at least as far as I can see, is that if it kept retrying the buffers forever, it would be pointless and might run the machine out of memory, so we might as well discard them. But previous comments have already illustrated that the kernel is not really up against a wall there -- it could put individual inodes into a permanent failure state when it discards their dirty data, as you suggested, or it could do what others have suggested, and what I think is better, which is to put the whole filesystem into a permanent failure state that can be cleared by remounting the FS. That could be done on an as-needed basis -- if the number of dirty buffers you're holding onto for some filesystem becomes too large, put the filesystem into infinite-fail mode and discard them all. That behavior would be pretty easy for administrators to understand and would resolve the entire problem here provided that no PostgreSQL processes survived the eventual remount. I also don't really know what we mean by an "unresolvable" error. If the drive is beyond all hope, then it doesn't really make sense to talk about whether the database stored on it is corrupt. In general we can't be sure that we'll even get an error - e.g. the system could be idle and the drive could be on fire. Maybe this is the case you meant by "it'd be nice if we could report it reliably". But at least in my experience, that's typically not what's going on. You get some I/O errors and so you remount the filesystem, or reboot, or rebuild the array, or ... something. And then the errors go away and, at that point, you want to run recovery and continue using your database. In this scenario, it matters *quite a bit* what the error reporting was like during the period when failures were occurring. In particular, if the database was allowed to think that it had successfully checkpointed when it didn't, you're going to start recovery from the wrong place. I'm going to shut up now because I'm telling you things that you obviously already know, but this doesn't sound like a "near irresolvable corner case". When the storage goes bonkers, either PostgreSQL and the kernel can interact in such a way that a checkpoint can succeed without all of the relevant data getting persisted, or they don't. It sounds like right now they do, and I'm not really clear that we have a reasonable idea how to fix that. It does not sound like a PANIC is sufficient. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company