Hi,

On 2016-02-20 20:56:31 +0100, Fabien COELHO wrote:
> >* Currently *_flush_after can be set to a nonzero value, even if there's
> > no support for flushing on that platform. Imo that's ok, but perhaps
> > other people's opinion differ.
> 
> In some previous version I think a warning was shown of the feature was
> requested but not available.

I think we should either silently ignore it, or error out. Warnings
somewhere in the background aren't particularly meaningful.

> Here are some quick comments on the patch:
> 
> Patch applies cleanly on head. Compiled and checked on Linux. Compilation
> issues on other systems, see below.

For those I've already pushed a small fixup commit to git... Stupid
mistake.


> The documentation seems to use "flush" but the code talks about "writeback"
> or "flush", depending. I think one vocabulary, whichever it is, should be
> chosen and everything should stick to it, otherwise everything look kind of
> fuzzy and raises doubt for the reader (is it the same thing? is it something
> else?). I initially used "flush", but it seems a bad idea because it has
> nothing to do with the flush function, so I'm fine with writeback or anything
> else, I just think that *one* word should be chosen and used everywhere.

Hm.


> The sgml documentation about "*_flush_after" configuration parameter talks
> about bytes, but the actual unit should be buffers.

The unit actually is buffers, but you can configure it using
bytes. We've done the same for other GUCs (shared_buffers, wal_buffers,
...). Refering to bytes is easier because you don't have to explain that
it depends on compilation settings how many data it actually is and
such.

> Also, the maximum value (128 ?) should appear in the text. \

Right.


> In the discussion in the wal section, I'm not sure about the effect of
> setting writebacks on SSD, but I think that you have made some tests
> so maybe you have an answer and the corresponding section could be
> written with some more definitive text than "probably brings no
> benefit on SSD".

Yea, that paragraph needs some editing. I think we should basically
remove that last sentence.


> A good point of the whole approach is that it is available to all kind
> of pg processes.

Exactly.


> However it does not address the point that bgwriter and
> backends basically issue random writes, so I would not expect much positive
> effect before these writes are somehow sorted, which means doing some
> compromise in the LRU/LFU logic...

The benefit is primarily that you don't collect large amounts of dirty
buffers in the kernel page cache. In most cases the kernel will not be
able to coalesce these writes either...  I've measured *massive*
performance latency differences for workloads that are bigger than
shared buffers - because suddenly bgwriter / backends do the majority of
the writes. Flushing in the checkpoint quite possibly makes nearly no
difference in such cases.


> well, all this is best kept for later, and I'm fine to have the logic
> flushing logic there. I'm wondering why you choose 16 & 64 as default
> for backends & bgwriter, though.

I chose a small value for backends because there often are a large
number of backends, and thus the amount of dirty data of each adds up. I
used a larger value for bgwriter because I saw that ending up using
bigger IOs.


> IssuePendingWritebacks: you merge only strictly neightboring writes.
> Maybe the merging strategy could be more aggressive than just strict
> neighbors?

I don't think so. If you flush more than neighbouring writes you'll
often end up flushing buffers dirtied by another backend, causing
additional stalls. And if the writes aren't actually neighbouring
there's not much gained from issuing them in one sync_file_range call.


> mdwriteback: all variables could be declared within the while, I do not
> understand why some are in and some are out.

Right.


> ISTM that putting writeback management at the relation level does not
> help a lot, because you have to translate again from relation to
> files.

Sure, but what's the problem with that? That's how normal read/write IO
works as well?


> struct WritebackContext: keeping a pointer to guc variables is a kind of
> trick, I think it deserves a comment.

It has, it's just in WritebackContextInit(). Can duplicateit.


> ScheduleBufferTagForWriteback: the "pending" variable is not very
> useful.

Shortens line length a good bit, at no cost.



> IssuePendingWritebacks: I understand that qsort is needed "again"
> because when balancing writes over tablespaces they may be intermixed.

Also because the infrastructure is used for more than checkpoint
writes. There's absolutely no ordering guarantees there.


> AFAICR I used a "flush context" for each table space in some version
> I submitted, because I do think that this whole writeback logic really
> does make sense *per table space*, which suggest that there should be as
> many write backs contexts as table spaces, otherwise the positive effect
> may going to be totally lost of tables spaces are used. Any thoughts?

Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other.  For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.


> SyncOneBuffer: I'm wonder why you copy the tag after releasing the lock.
> I guess it is okay because it is still pinned.

Don't do things while holding a lock that don't require said lock. A pin
prevents a buffer changing its identity.


> For the checkpointer, a key aspect is that the scheduling process goes
> to sleep from time to time, and this sleep time looked like a great
> opportunity to do this kind of flushing. You choose not to take advantage
> of the behavior, why?

Several reasons: Most importantly there's absolutely no guarantee that
you'll ever end up sleeping, it's quite common to happen only
seldomly. If you're bottlenecked on IO, you can end up being behind all
the time. But even then you don't want to cause massive latency spikes
due to gigabytes of dirty data - a slower checkpoint is a much better
choice.  It'd make the writeback infrastructure less generic.  I also
don't really believe it helps that much, although that's a complex
argument to make.


Thanks for the review!

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to