On 10 April 2018 at 03:59, Andres Freund <and...@anarazel.de> wrote: > On 2018-04-09 14:41:19 -0500, Justin Pryzby wrote: >> On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote: >> > You could make the argument that it's OK to forget if the entire file >> > system goes away. But actually, why is that ok? >> >> I was going to say that it'd be okay to clear error flag on umount, since any >> opened files would prevent unmounting; but, then I realized we need to >> consider >> the case of close()ing all FDs then opening them later..in another process. > >> On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote: >> > notification descriptor open, where the kernel would inject events >> > related to writeback failures of files under watch (potentially >> > enriched to contain info regarding the exact failed pages and >> > the file offset they map to). >> >> For postgres that'd require backend processes to open() an file such that, >> following its close(), any writeback errors are "signalled" to the >> checkpointer >> process... > > I don't think that's as hard as some people argued in this thread. We > could very well open a pipe in postmaster with the write end open in > each subprocess, and the read end open only in checkpointer (and > postmaster, but unused there). Whenever closing a file descriptor that > was dirtied in the current process, send it over the pipe to the > checkpointer. The checkpointer then can receive all those file > descriptors (making sure it's not above the limit, fsync(), close() ing > to make room if necessary). The biggest complication would presumably > be to deduplicate the received filedescriptors for the same file, > without loosing track of any errors.
Yep. That'd be a cheaper way to do it, though it wouldn't work on Windows. Though we don't know how Windows behaves here at all yet. Prior discussion upthread had the checkpointer open()ing a file at the same time as a backend, before the backend writes to it. But passing the fd when the backend is done with it would be better. We'd need a way to dup() the fd and pass it back to a backend when it needed to reopen it sometimes, or just make sure to keep the oldest copy of the fd when a backend reopens multiple times, but that's no biggie. We'd still have to fsync() out early in the checkpointer if we ran out of space in our FD list, and initscripts would need to change our ulimit or we'd have to do it ourselves in the checkpointer. But neither seems insurmountable. FWIW, I agree that this is a corner case, but it's getting to be a pretty big corner with the spread of overcommitted, dedupliating SANs, cloud storage, etc. Not all I/O errors indicate permanent hardware faults, disk failures, etc, as I outlined earlier. I'm very curious to know what AWS EBS's error semantics are, and other cloud network block stores. (I posted on Amazon forums https://forums.aws.amazon.com/thread.jspa?threadID=279274&tstart=0 but nothing so far). I'm also not particularly inclined to trust that all file systems will always reliably reserve space without having some cases where they'll fail writeback on space exhaustion. So we don't need to panic and freak out, but it's worth looking at the direction the storage world is moving in, and whether this will become a bigger issue over time. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services