Re: Postgres, fsync, and OSs (specifically linux)

Andres Freund Sat, 28 Apr 2018 09:33:06 -0700

Hi,

On 2018-04-28 20:00:25 +0800, Craig Ringer wrote:
> On 28 April 2018 at 06:28, Andres Freund <and...@anarazel.de> wrote:
> > The second major type of proposal was using direct-IO. That'd generally
> > be a desirable feature, but a) would require some significant changes to
> > postgres to be performant, b) isn't really applicable for the large
> > percentage of installations that aren't tuned reasonably well, because
> > at the moment the OS page cache functions as a memory-pressure aware
> > extension of postgres' page cache.
> 
> Yeah. I've avoided advocating for O_DIRECT because it's a big job
> (understatement). We'd need to pay so much more attention to details
> of storage layout if we couldn't rely as much on the kernel neatly
> organising and queuing everything for us, too.
> 
> At the risk of displaying my relative ignorance of direct I/O: Does
> O_DIRECT without O_SYNC even provide a strong guarantee that when you
> close() the file, all I/O has reliably succeeded? It must've gone
> through the FS layer, but don't FSes do various caching and
> reorganisation too? Can the same issue arise in other ways unless we
> also fsync() before close() or write O_SYNC?


No, not really. There's generally two categories of IO here: Metadata IO
and data IO. The filesystem's metadata IO a) has a lot more error
checking (including things like remount-ro, stalling the filesystem on
errors etc), b) isn't direct IO itself.  For some filesystem metadata
operations you'll still need fsyncs, but the *data* is flushed if use
use DIO. I'd personally use O_DSYNC | O_DIRECT, and have the metadata
operations guaranteed by fsyncs.  You'd need the current fsyncs for
renaming, and probably some fsyncs for file extensions. The latter to
make sure the filesystem has written the metadata change.


> At one point I looked into using AIO instead. But last I looked it was
> pretty spectacularly quirky when it comes to reliably flushing, and
> outright broken on some versions. In any case, our multiprocessing
> model would make tracking completions annoying, likely more so than
> the sort of FD handoff games we've discussed.

AIO pretty much only works sensibly with DIO.


> > Another topic brought up in this thread was the handling of ENOSPC
> > errors that aren't triggered on a filesystem level, but rather are
> > triggered by thin provisioning. On linux that currently apprently lead
> > to page cache contents being lost (and errors "eaten") in a lot of
> > places, including just when doing a write().
> 
> ... wow.
> 
> Is that with lvm-thin?

I think both dm and lvm (I typed llvm thrice) based thin
provisioning. The FS code basically didn't expect ENOSPC being returned
from storage, but suddenly the storage layer started returning it...


> The thin provisioning I was mainly concerned with is SAN-based thin
> provisioning, which looks like a normal iSCSI target or a normal LUN
> on a HBA to Linux. Then it starts failing writes with a weird
> potentially vendor-specific sense error if it runs out of backing
> store. How that's handled likely depends on the specific error, the
> driver, which FS you use, etc. In the case I saw, multipath+lvm+xfs,
> it resulted in lost writes and fsync() errors being reported once, per
> the start of the original thread.

I think the concerns are largely the same for that. You'll have to
configure the SAN to block in that case.


> > - Matthew Wilcox proposed (and posted a patch) that'd partially revert
> >   behaviour to the pre v4.13 world, by *also* reporting errors to
> >   "newer" file-descriptors if the error hasn't previously been
> >   reported. That'd still not guarantee that the error is reported
> >   (memory pressure could evict information without open fd), but in most
> >   situations we'll again get the error in the checkpointer.
> >
> >   This seems largely be agreed upon. It's unclear whether it'll go into
> >   the stable backports for still-maintained >= v4.13 kernels.
> 
> That seems very sensible. In our case we're very unlikely to have some
> other unrelated process come in and fsync() our files for us.
> 
> I'd want to be sure the report didn't get eaten by sync() or syncfs() though.

It doesn't. Basically every fd has an errseq_t value copied into it at
open.


> > - syncfs() will be fixed so it reports errors properly - that'll likely
> >   require passing it an O_PATH filedescriptor to have space to store the
> >   errseq_t value that allows discerning already reported and new errors.
> >
> >   No patch has appeared yet, but the behaviour seems largely agreed
> >   upon.
> 
> Good, but as you noted, of limited use to us unless we want to force
> users to manage space for temporary and unlogged relations completely
> separately.

Well, I think it'd still be ok as a backstop if it had decent error
semantics. We don't checkpoint that often, and doing the syncing via
syncfs() is considerably more efficient than individual fsync()s. But
given it's currently buggy that tradeoff is moot.


> I wonder if we could convince the kernel to offer a file_sync_mode
> xattr to control this? (Hint: I'm already running away in a mylar fire
> suit).

Err. I am fairly sure you're not going to get anywhere with that. Given
we're concerned about existing kernels, I doubt it'd help us much anyway.


> > - Stop inodes with unreported errors from being evicted. This will
> >   guarantee that a later fsync (without an open FD) will see the
> >   error. The memory pressure concerns here are lower than with keeping
> >   all the failed pages in memory, and it could be optimized further.
> >
> >   I read some tentative agreement behind this idea, but I think it's the
> >   by far most controversial one.
> 
> The main issue there would seem to be cases of whole-FS failure like
> the USB-key-yank example. You're going to have to be able to get rid
> of them at some point.

It's not actually a real problem (despite initially being brought up a
number of times by kernel people). There's a separate error for that
(ENODEV), and filesystems already handle it differently. Once that's
returned, fsyncs() etc are just shortcut to ENODEV.


> > What we could do:
> >
> > - forward file descriptors from backends to checkpointer (using
> >   SCM_RIGHTS) when marking a segment dirty. That'd require some
> >   optimizations (see [4]) to avoid doing so repeatedly.  That'd
> >   guarantee correct behaviour in all linux kernels >= 4.13 (possibly
> >   backported by distributions?), and I think it'd also make it vastly
> >   more likely that errors are reported in earlier kernels.
> 
> It'd be interesting to see if other platforms that support fd passing
> will give us the desired behaviour too. But even if it only helps on
> Linux, that's a huge majority of the PostgreSQL deployments these
> days.

Afaict it'd not help all of them. It does provide guarantees against the
inode being evicted on pretty much all OSs, but not all of them have an
error counter there...


> > - Use direct IO. Due to architectural performance issues in PG and the
> >   fact that it'd not be applicable for all installations I don't think
> >   this is a reasonable fix for the issue presented here. Although it's
> >   independently something we should work on.  It might be worthwhile to
> >   provide a configuration that allows to force DIO to be enabled for WAL
> >   even if replication is turned on.
> 
> Seems like a long term goal, but you've noted elsewhere that doing it
> well would be hard. I suspect we'd need writer threads, we'd need to
> know more about the underlying FS/storage layout to make better
> decisions about write parallelism, etc. We get away with a lot right
> now by letting the kernel and buffered I/O sort that out.

We're a *lot* slower due to it.

Don't think you would need writer threads, "just" a bgwriter that
actually works and provides clean buffers unless the machine is
overloaded. I've posted a patch that adds that. On the write side you
then additionally need write combining (doing one writes for several
on-disk-consecutive buffers), which isn't trivial to add currently.  The
bigger issue than writes is actually doing reads nicely. There's no
readahead anymore, and we'd not have the kernel backstopping our bad
caching decisions anymore.

Greetings,

Andres Freund

Re: Postgres, fsync, and OSs (specifically linux)

Reply via email to