On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote: > On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman <mgor...@suse.de> wrote: > >> Amen to that. Actually, I think NUMA can be (mostly?) fixed by > >> setting zone_reclaim_mode; is there some other problem besides that? > > > > Really? > > > > zone_reclaim_mode is often a complete disaster unless the workload is > > partitioned to fit within NUMA nodes. On older kernels enabling it would > > sometimes cause massive stalls. I'm actually very surprised to hear it > > fixes anything and would be interested in hearing more about what sort > > of circumstnaces would convince you to enable that thing. > > By "set" I mean "set to zero". We've seen multiple of instances of > people complaining about large amounts of system memory going unused > because this setting defaulted to 1. > > >> The other thing that comes to mind is the kernel's caching behavior. > >> We've talked a lot over the years about the difficulties of getting > >> the kernel to write data out when we want it to and to not write data > >> out when we don't want it to. > > > > Is sync_file_range() broke? > > I don't know. I think a few of us have played with it and not been > able to achieve a clear win.
Before you go back down the sync_file_range path, keep in mind that it is not a guaranteed data integrity operation: it does not force device cache flushes like fsync/fdatasync(). Hence it does not guarantee that the metadata that points at the data written nor the volatile caches in the storage path has been flushed... IOWs, using sync_file_range() does not avoid the need to fsync() a file for data integrity purposes... > Whether the problem is with the system > call or the programmer is harder to determine. I think the problem is > in part that it's not exactly clear when we should call it. So > suppose we want to do a checkpoint. What we used to do a long time > ago is write everything, and then fsync it all, and then call it good. > But that produced horrible I/O storms. So what we do now is do the > writes over a period of time, with sleeps in between, and then fsync > it all at the end, hoping that the kernel will write some of it before > the fsyncs arrive so that we don't get a huge I/O spike. > And that sorta works, and it's definitely better than doing it all at > full speed, but it's pretty imprecise. If the kernel doesn't write > enough of the data out in advance, then there's still a huge I/O storm > when we do the fsyncs and everything grinds to a halt. If it writes > out more data than needed in advance, it increases the total number of > physical writes because we get less write-combining, and that hurts > performance, too. Yup, the kernel defaults to maximising bulk write throughput, which means it waits to the last possible moment to issue write IO. And that's exactly to maximise write combining, optimise delayed allocation, etc. There are many good reasons for doing this, and for the majority of workloads it is the right behaviour to have. It sounds to me like you want the kernel to start background writeback earlier so that it doesn't build up as much dirty data before you require a flush. There are several ways to do this by tweaking writeback knobs. The simplest is probably just to set /proc/sys/vm/dirty_background_bytes to an appropriate threshold (say 50MB) and dirty_expire_centiseconds to a few seconds so that background writeback starts and walks all dirty inodes almost immediately. This will keep a steady stream of low level background IO going, and fsync should then not take very long. Fundamentally, though, we need bug reports from people seeing these problems when they see them so we can diagnose them on their systems. Trying to discuss/diagnose these problems without knowing anything about the storage, the kernel version, writeback thresholds, etc really doesn't work because we can't easily determine a root cause. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers