On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
> On 08/27/2014 04:20 PM, Andres Freund wrote:
> >On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
> >>>I think a somewhat smarter version of the explicit flushes in the
> >>>hack^Wpatch I posted nearby is going to more likely to be successful.
> >>That path is "dangerous" (as in, may not work as intended) if the
> >>filesystem doesn't properly understand range flushes (ehem, like
> >The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
> >guaranteeing durability. And - afaik - not implemented in a file system
> >specific manner. It just initiates writeback for individual pages. It
> >doesn't cause barrier, journal flushes or anything to be issued. That's
> >still done by the fsync() later.
> >The big disadvantage is that it's a OS specific solution, but I don't
> >think we're going to find anything that isn't in this area.
> I've been thinking for a long time that we should interleave the writes and
> the fsyncs. That still forces up to 1GB of dirty buffers to disk at once,
> causing a spike, but at least not more than that.
I think there are considerable benefits to sorting checkpoint io by file
and offset in that file. Obviously the likelihood of sequential IO is
higher; but there's also less chance that other processes write out
dirty buffers that have to be flushed out by the fsync() in a drive by
I don't think it's good enough to solve the problem Fabien is talking
about though. 1GB is heck of a lot of IO to submit at once. That'll
cause latency issues unless you have a write back controller with more
than 1GB of cache.
So I think we need both, control over the amount of dirty data in the
kernel *and* sorted writeouts. To the point that I've been tinkering
with converting buftable.c into a radix tree. That'd allow to
efficiently scan all buffers of a filenode in order. Useful for
checkpoints, but also for lots of other things. Unfortunately our buffer
tags are freakishly huge, making the worst case memory requirements and
the depth of tree quite bad.
> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
> them out in order
> The performance impact of that was inconclusive, but one thing that it
> allows nicely is to interleave the fsyncs, so that you write all the buffers
> for one file, then fsync it, then next file and so on. IIRC the biggest
> worry with that patch was that sorting the buffers requires a fairly large
> amount of memory, and making a large allocation in the checkpointer might
> cause an out-of-memory, which would be bad.
> I don't think anyone's seriously worked on this area since. If the impact on
> responsiveness or performance is significant, I'm pretty sure the OOM
> problem could be alleviated somehow.
It's a major problem imo.
What I'd been thinking of is to checkpoint writeout in batches. Collect
100k buffers, sort them, write them out. Go to the next 100k.
> For the kicks, I wrote a quick & dirty patch for interleaving the fsyncs,
> see attached. It works by repeatedly scanning the buffer pool, writing
> buffers belonging to a single relation segment at a time. I would be
> interested to hear how this performs in your test case.
I bet it's not fundamentally changing the amount of transactions that
don't make the deadline - there's more than enough dirty buffers in one
1GB segment to cause issues. But I think it might already be a
significant benefit for peak throughput *and* latency if you combine it
with my approach of initiating writeout to disk during the whole
sync. My patch can slow things down considerably in the worst case by
causing superflous random IO, which your patch should alleviate.
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: