Hello Andres,

there would be as much as was written since the last sleep, about 100
ms ago, which is not likely to be gigabytes?

In many cases we don't sleep all that frequently - after one 100ms sleep
we're already behind a lot.

I think that "being behind" is not a problem as such, it is really the way the scheduler has been designed and works, by keeping pace with time & wall progress by little bursts of writes. If you reduce the sleep time a lot then it would end up having writes interleaved with small sleeps, but then this would be bad for performance has the OS would loose the ability to write much data sequentially on the disk.

It does not mean that the default 100 ms is a good figure, but the "being behind" is a feature, not an issue as such.

And even so, it's pretty easy to get into checkpoint scenarios with ~500 mbyte/s as a writeout rate.

Hmmmm. Not with my hardware:-)

Only issuing a sync_file_range() 10 times for that is obviously problematic.

Hmmm. Then it should depend on the expected write capacity of the underlying disks...

The implementation pretty always will go behind schedule for some
time. Since sync_file_range() doesn't flush in the foreground I don't
think it's important to do the flushing in concert with sleeping.

For me it is important to avoid accumulating too large flushes, and that is
the point of the call before sleeping.

I don't follow this argument. It's important to avoid large flushes,
therefore we potentially allow large flushes to accumulate?

On my simple test hardware the flushes are not large, I think, so the problem does not arise. Maybe I should check.

My testing seems to show that just adding a limit of 32 buffers to
FileAsynchronousFlush() leads to markedly better results.

Hmmm. 32 buffers means 256 KB, which is quite small.


Because the point of sorting is to generate sequential writes so that the
HDD has a lot of aligned stuff to write without moving the head, and 32 is
rather small for that.

A sync_file_range(SYNC_FILE_RANGE_WRITE) doesn't synchronously write
data back. It just puts it into the write queue.


You can have merging between IOs from either side. But more importantly you can't merge that many requests together anyway.


The aim is to not overwhelm the request queue - which is where the
coalescing is done. And usually that's rather small.

That is an argument. How small, though? It seems to be 128 by default, so
I'd rather have 128? Also, it can be changed, so maybe it should really be a

I couldn't see any benefits above (and below) 32 on a 20 drive system,

So it is one kind of (big) hardware. Assuming that pages are contiguous, how much is written on each disk depends on the RAID type, the stripe size, and when it is really written depends on the various cache (in the RAID HW card if any, on the disk, ...), so whether 32 at the OS level is the right size is pretty unclear to me. I would have said the larger the better, but indeed you should avoid blocking.

so I doubt it's worthwhile. It's actually good for interactivity to
allow other requests into the queue concurrently - otherwise other
reads/writes will obviously have a higher latency...

Sure. Now on my tests, with my (old & little) hardware it seemed quite smooth. What I'm driving at is that what is good may be relative and depend on the underlying hardware, which makes it not obvious to choose the right parameter.

If you flush much more sync_file_range starts to do work in the

Argh, too bad. I would have hoped that the would just deal with in an
asynchronous way,

It's even in the man page:
"Note  that  even  this  may  block if you attempt to write more than
request queue size."

Hmmm. What about choosing "request queue size * 0.5", then ?

Because it should be in shared buffers where pg needs it?

Huh? I'm just suggesting p = mmap(fd, offset, bytes);msync(p, bytes);munmap(p);
instead of sync_file_range().

I think that I do not really understand how it may work, but possible it could.

ISTM that it is rather an argument for taking the tablespace into the
sorting, not necessarily for a binary heap.

I don't understand your problem with that. The heap specific code is
small, smaller than your NextBufferToWrite() implementation?

You have not yet posted the updated version of the patch.

Thee complexity of the round robin scan on the array is O(1) and very few instructions, plus some stop condition which is mostly true I think if the writes are balanced between table spaces, there is no dynamic allocation in the data structure (it is an array). The binary heap is O(log(n)), probably there are dynamic allocations and frees when extracting/inserting something, there are functions calls to rebalance the tree, and so on. Ok, "n" is expected to be small.

So basically, for me it is not obviously superior to the previous version. Now I'm also tired, so if it works reasonably I'll be fine with it.

[... code extract ...]

I don't think that makes much sense. All a longer sleep achieves is
creating a larger burst of writes afterwards. We should really sleep

It sounds reasonable, but what would be the criterion?

What IsCheckpointOnSchedule() does is essentially to calculate progress
for two things:
1) Are we on schedule based on WAL segments until CheckPointSegments
  (computed via max_wal_size these days). I.e. is the percentage of
  used up WAL bigger than the percentage of written out buffers.

2) Are we on schedule based on checkpoint_timeout. I.e. is the
  percentage of checkpoint_timeout already passed bigger than the
  percentage of buffers written out.

So the trick is just to compute the number of work items (e.g. buffers
to write out) and divide the remaining time by it. That's how long you
can sleep.

See discussion above. ISTM that the "bursts" is a useful feature of the checkpoint scheduler, especially with sorted buffers & flushes. You want to provide grouped writes that will be easilly written to disk together. You do not want to have page writes issued one by one and interleaved with small sleeps.

It's slightly trickier for WAL and I'm not sure it's equally
important. But even there it shouldn't be too hard to calculate the
amount of time till we're behind on schedule and only sleep that long.

The scheduler stops writing as soon as it has overtaken the progress, so it should be a very small time, but if you do that you would end up writing pages one by one, which is not desirable at all.

I'm running benchmarks right now, they'll take a bit to run to


I'm looking forward to have a look at the updated version of the patch.


Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to