On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote: > > We already have dirty_bytes and dirty_background_bytes, for example. I > don't see why there couldn't be another limit defining how much dirty > data to allow before blocking writes altogether. I'm sure it's not that > simple, but you get the general idea - do not allow using all available > memory because of writeback issues, but don't throw the data away in > case it's just a temporary issue.
Sure, there could be knobs for limiting how much memory such "zombie" pages may occupy. Not sure how helpful it would be in the long run since this tends to be highly application-specific, and for something with a large data footprint one would end up tuning this accordingly in a system-wide manner. This has the potential to leave other applications running in the same system with very little memory, in cases where for example original application crashes and never clears the error. Apart from that, further interfaces would need to be provided for actually dealing with the error (again assuming non-transient issues that may not be fixed transparently and that temporary issues are taken care of by lower layers of the stack). > Well, there seem to be kernels that seem to do exactly that already. At > least that's how I understand what this thread says about FreeBSD and > Illumos, for example. So it's not an entirely insane design, apparently. It is reasonable, but even FreeBSD has a big fat comment right there (since 2017), mentioning that there can be no recovery from EIO at the block layer and this needs to be done differently. No idea how an application running on top of either FreeBSD or Illumos would actually recover from this error (and clear it out), other than remounting the fs in order to force dropping of relevant pages. It does provide though indeed a persistent error indication that would allow Pg to simply reliably panic. But again this does not necessarily play well with other applications that may be using the filesystem reliably at the same time, and are now faced with EIO while their own writes succeed to be persisted. Ideally, you'd want a (potentially persistent) indication of error localized to a file region (mapping the corresponding failed writeback pages). NetBSD is already implementing fsync_ranges(), which could be a step in the right direction. > One has to wonder how many applications actually use this correctly, > considering PostgreSQL cares about data durability/consistency so much > and yet we've been misunderstanding how it works for 20+ years. I would expect it would be very few, potentially those that have a very simple process model (e.g. embedded DBs that can abort a txn on fsync() EIO). I think that durability is a rather complex cross-layer issue which has been grossly misunderstood similarly in the past (e.g. see ). It seems that both the OS and DB communities greatly benefit from a periodic reality check, and I see this as an opportunity for strengthening the IO stack in an end-to-end manner. Best regards, Anthony  https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf