Re: [HACKERS] checkpointer continuous flushing - V18

Fabien COELHO Sat, 20 Feb 2016 23:27:45 -0800


Hallo Andres,

In some previous version I think a warning was shown if the feature was
requested but not available.


I think we should either silently ignore it, or error out. Warnings
somewhere in the background aren't particularly meaningful.

I like "ignoring with a warning" in the log file, because when things donot behave as expected that is where I'll be looking. I do not think thatit should error out.

The sgml documentation about "*_flush_after" configuration parametertalks about bytes, but the actual unit should be buffers.
The unit actually is buffers, but you can configure it using
bytes. We've done the same for other GUCs (shared_buffers, wal_buffers,
...). Refering to bytes is easier because you don't have to explain that
it depends on compilation settings how many data it actually is and
such.

So I understand that it works with kb as well. Now I do not think that itwould need a lot if explanations if you say that it is a number of pages,and I think that a number of pages is significant because it is a numberof IO requests to be coalesced, eventually.

In the discussion in the wal section, I'm not sure about the effect of
setting writebacks on SSD, [...]


Yea, that paragraph needs some editing. I think we should basically
remove that last sentence.

Ok, fine with me. Does that mean that flushing as a significant positiveimpact on SSD in your tests?

However it does not address the point that bgwriter and backendsbasically issue random writes, [...]


The benefit is primarily that you don't collect large amounts of dirty
buffers in the kernel page cache. In most cases the kernel will not be
able to coalesce these writes either...  I've measured *massive*
performance latency differences for workloads that are bigger than
shared buffers - because suddenly bgwriter / backends do the majority of
the writes. Flushing in the checkpoint quite possibly makes nearly no
difference in such cases.


So I understand that there is a positive impact under some load. Good!

Maybe the merging strategy could be more aggressive than just strict
neighbors?


I don't think so. If you flush more than neighbouring writes you'll
often end up flushing buffers dirtied by another backend, causing
additional stalls.

Ok. Maybe the neightbor definition could be relaxed just a little bit sothat small holes are overtake, but not large holes? If there is only a fewpages in between, even if written by another process, then writing themtogether should be better? Well, this can wait for a clear case, becausehopefully the OS will recoalesce them behind anyway.

struct WritebackContext: keeping a pointer to guc variables is a kind of
trick, I think it deserves a comment.


It has, it's just in WritebackContextInit(). Can duplicateit.

I missed it, I expected something in the struct definition. Do notduplicate, but cross reference it?

IssuePendingWritebacks: I understand that qsort is needed "again"
because when balancing writes over tablespaces they may be intermixed.


Also because the infrastructure is used for more than checkpoint
writes. There's absolutely no ordering guarantees there.


Yep, but not much benefit to expect from a few dozens random pages either.

[...] I do think that this whole writeback logic really does make sense*per table space*,


Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other.  For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.


I do not understand/buy this argument.

The underlying IO queue is per device, and table spaces should be perdevice as well (otherwise what the point?), so you should want to coalesceand "writeback" pages per device as wel. Calling sync_file_range ondistinct devices should probably be issued more or less randomly, andshould not interfere one with the other.

If you use just one context, the more table spaces the less performancegains, because there is less and less aggregation thus sequential writesper device.

So for me there should really be one context per tablespace. That wouldsuggest a hashtable or some other structure to keep and retrieve them,which would not be that bad, and I think that it is what is needed.

For the checkpointer, a key aspect is that the scheduling process goes
to sleep from time to time, and this sleep time looked like a great
opportunity to do this kind of flushing. You choose not to take advantage
of the behavior, why?
Several reasons: Most importantly there's absolutely no guarantee thatyou'll ever end up sleeping, it's quite common to happen only seldomly.

Well, that would be under a situation when pg is completely unresponsive.More so, this behavior *makes* pg unresponsive.

If you're bottlenecked on IO, you can end up being behind all the time.


Hopefully sorting & flushing should improve this situation a lot.

But even then you don't want to cause massive latency spikes
due to gigabytes of dirty data - a slower checkpoint is a much better
choice.  It'd make the writeback infrastructure less generic.

Sure. It would be sufficient to have a call to ask for writebacksindependently of the number of writebacks accumulated in the queue, itdoes not need to change the infrastructure.

Also, I think that such a call would make sense at the end of thecheckpoint.

I also don't really believe it helps that much, although that's acomplex argument to make.

Yep. My thinking is that doing things in the sleeping interval does notinterfere with the checkpointer scheduling, so it is less likely to gowrong and falling behind.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing - V18

Reply via email to