On Tue, Jan 21, 2014 at 3:20 PM, Jan Kara <j...@suse.cz> wrote: >> But that still doesn't work out very well, because now the guy who >> does the write() has to wait for it to finish before he can do >> anything else. That's not always what we want, because WAL gets >> written out from our internal buffers for multiple different reasons. > Well, you can always use AIO (io_submit) to submit direct IO without > waiting for it to finish. But then you might need to track the outstanding > IO so that you can watch with io_getevents() when it is finished.
Yeah. That wouldn't work well for us; the process that did the io_submit() would want to move on to other things, and how would it, or any other process, know that the I/O had completed? > As I wrote in some other email in this thread, using IO priorities for > data file checkpoint might be actually the right answer. They will work for > IO submitted by fsync(). The downside is that currently IO priorities / IO > scheduling classes work only with CFQ IO scheduler. IMHO, the problem is simpler than that: no single process should be allowed to completely screw over every other process on the system. When the checkpointer process starts calling fsync(), the system begins writing out the data that needs to be fsync()'d so aggressively that service times for I/O requests from other process go through the roof. It's difficult for me to imagine that any application on any I/O scheduler is ever happy with that behavior. We shouldn't need to sprinkle of fsync() calls with special magic juju sauce that says "hey, when you do this, could you try to avoid causing the rest of the system to COMPLETELY GRIND TO A HALT?". That should be the *default* behavior, if not the *only* behavior. Now, that is not to say that we're unwilling to sprinkle magic juju sauce if that's what it takes to solve this problem. If calling fadvise() or sync_file_range() or some new API that you invent at some point prior to calling fsync() helps the kernel do the right thing, we're willing to do that. Or if you/the Linux community wants to invent a new API fsync_but_do_not_crush_system() and have us call that instead of the regular fsync(), we're willing to do that, too. But I think there's an excellent case to be made, at least as far as checkpoint I/O spikes are concerned, that the API is just fine as it is and Linux's implementation is simply naive. We'd be perfectly happy to wait longer for fsync() to complete in exchange for not starving the rest of the system - and really, who wouldn't? Linux is a multi-user system, and apportioning resources among multiple tasks is a basic function of a multi-user kernel. </rant> Anyway, if CFQ or any other Linux I/O scheduler gets an option to lower the priority of the fsyncs, I'm sure somebody here will test it out and see whether it solves this problem. AFAICT, experiments to date have pretty much universally shown CFQ to be worse than not-CFQ and everything else to be more or less equivalent - but if that changes, I'm sure many PostgreSQL DBAs will be more than happy to flip CFQ back on. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers