On 2016-01-07 16:05:32 +0100, Fabien COELHO wrote:
> >But I'm inclined to go a different way: I think it's a mistake to do
> >flusing based on a single file. It seems better to track a fixed number of
> >outstanding 'block flushes', independent of the file. Whenever the number
> >of outstanding blocks is exceeded, sort that list, and flush all
> >outstanding flush requests after merging neighbouring flushes.
> Hmmm. I'm not sure I understand your strategy.
> I do not think that flushing without a prior sorting would be effective,
> because there is no clear reason why buffers written together would then be
> next to the other and thus give sequential write benefits, we would just get
> flushed random IO, I tested that and it worked badly.

Oh, I was thinking of sorting & merging these outstanding flushes. Sorry
for not making that clear.

> One of the point of aggregating flushes is that the range flush call cost
> is significant, as shown by preliminary tests I did, probably up in the
> thread, so it makes sense to limit this cost, hence the aggregation. These
> removed some performation regression I had in some cases.

FWIW, my tests show that flushing for clean ranges is pretty cheap.

> Also, the granularity of the buffer flush call is a file + offset + size, so
> necessarily it should be done this way (i.e. per file).

What syscalls we issue, and at what level we track outstanding flushes,
doesn't have to be the same.

> Once buffers are sorted per file and offset within file, then written
> buffers are as close as possible one after the other, the merging is very
> easy to compute (it is done on the fly, no need to keep the list of buffers
> for instance), it is optimally effective, and when the checkpointed file
> changes then we will never go back to it before the next checkpoint, so
> there is no reason not to flush right then.

Well, that's true if there's only one tablespace, but e.g. not the case
with two tablespaces of about the same number of dirty buffers.

> So basically I do not see a clear positive advantage to your suggestion,
> especially when taking into consideration the scheduling process of the
> scheduler:

I don't think it makes a big difference for the checkpointer alone, but
it makes the interface much more suitable for other processes, e.g. the
bgwriter, and normal backends.

> >Imo that means that we'd better track writes on a relfilenode + block
> >number level.
> I do not think that it is a better option. Moreover, the current approach
> has been proven to be very effective on hundreds of runs, so redoing it
> differently for the sake of it does not look like good resource allocation.

For a subset of workloads, yes.


Andres Freund

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to