On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote:
> On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman <mgor...@suse.de> wrote:
> >> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
> >> setting zone_reclaim_mode; is there some other problem besides that?
> >
> > Really?
> >
> > zone_reclaim_mode is often a complete disaster unless the workload is
> > partitioned to fit within NUMA nodes. On older kernels enabling it would
> > sometimes cause massive stalls. I'm actually very surprised to hear it
> > fixes anything and would be interested in hearing more about what sort
> > of circumstnaces would convince you to enable that thing.
> By "set" I mean "set to zero".  We've seen multiple of instances of
> people complaining about large amounts of system memory going unused
> because this setting defaulted to 1.
> >> The other thing that comes to mind is the kernel's caching behavior.
> >> We've talked a lot over the years about the difficulties of getting
> >> the kernel to write data out when we want it to and to not write data
> >> out when we don't want it to.
> >
> > Is sync_file_range() broke?
> I don't know.  I think a few of us have played with it and not been
> able to achieve a clear win.

Before you go back down the sync_file_range path, keep in mind that
it is not a guaranteed data integrity operation: it does not force
device cache flushes like fsync/fdatasync(). Hence it does not
guarantee that the metadata that points at the data written nor the
volatile caches in the storage path has been flushed...

IOWs, using sync_file_range() does not avoid the need to fsync() a
file for data integrity purposes...

> Whether the problem is with the system
> call or the programmer is harder to determine.  I think the problem is
> in part that it's not exactly clear when we should call it.  So
> suppose we want to do a checkpoint.  What we used to do a long time
> ago is write everything, and then fsync it all, and then call it good.
>  But that produced horrible I/O storms.  So what we do now is do the
> writes over a period of time, with sleeps in between, and then fsync
> it all at the end, hoping that the kernel will write some of it before
> the fsyncs arrive so that we don't get a huge I/O spike.
> And that sorta works, and it's definitely better than doing it all at
> full speed, but it's pretty imprecise.  If the kernel doesn't write
> enough of the data out in advance, then there's still a huge I/O storm
> when we do the fsyncs and everything grinds to a halt.  If it writes
> out more data than needed in advance, it increases the total number of
> physical writes because we get less write-combining, and that hurts
> performance, too. 

Yup, the kernel defaults to maximising bulk write throughput, which
means it waits to the last possible moment to issue write IO. And
that's exactly to maximise write combining, optimise delayed
allocation, etc. There are many good reasons for doing this, and for
the majority of workloads it is the right behaviour to have.

It sounds to me like you want the kernel to start background
writeback earlier so that it doesn't build up as much dirty data
before you require a flush. There are several ways to do this by
tweaking writeback knobs. The simplest is probably just to set
/proc/sys/vm/dirty_background_bytes to an appropriate threshold (say
50MB) and dirty_expire_centiseconds to a few seconds so that
background writeback starts and walks all dirty inodes almost
immediately. This will keep a steady stream of low level background
IO going, and fsync should then not take very long.

Fundamentally, though, we need bug reports from people seeing these
problems when they see them so we can diagnose them on their
systems. Trying to discuss/diagnose these problems without knowing
anything about the storage, the kernel version, writeback
thresholds, etc really doesn't work because we can't easily
determine a root cause.


Dave Chinner

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to