Hi,

On 2015-11-12 15:31:41 +0100, Fabien COELHO wrote:
> >A lot of details have changed, I unfortunately don't remember them all.
> >But there are more important things than the details of the patch.
> >
> >I've played *a lot* with this patch. I found a bunch of issues:
> >
> >1) The FileFlushContext context infrastructure isn't actually
> >  correct. There's two problems: First, using the actual 'fd' number to
> >  reference a to-be-flushed file isn't meaningful. If there  are lots
> >  of files open, fds get reused within fd.c.
> 
> Hmm.
> 
> My assumption is that a file being used (i.e. with modifie pages, being used
> for writes...) would not be closed before everything is cleared...

That's likely, but far from guaranteed.

> After some poking in the code, I think that this issue may indeed be there,
> although the probability of hitting it is close to 0, but alas not 0:-)

I did hit it...

> To fix it, ITSM that it is enough to hold a "do not close lock" on the file
> while a flush is in progress (a short time) that would prevent mdclose to do
> its stuff.

Could you expand a bit more on this? You're suggesting something like a
boolean in the vfd struct? If that, how would you deal with FileClose()
being called?


> >3) I found that latency wasn't improved much for workloads that are
> >  significantly bigger than shared buffers. The problem here is that
> >  neither bgwriter nor the backends have, so far, done
> >  sync_file_range() calls. That meant that the old problem of having
> >  gigabytes of dirty data that periodically get flushed out, still
> >  exists. Having these do flushes mostly attacks that problem.
> 
> I'm concious that the patch only addresses *checkpointer* writes, not those
> from bgwrither or backends writes. I agree that these should need to be
> addressed at some point as well, but given the time to get a patch through,
> the more complex the slower (sort propositions are 10 years old), I think
> this should be postponed for later.

I think we need to have at least a PoC of all of the relevant
changes. We're doing these to fix significant latency and throughput
issues, and if the approach turns out not to be suitable for
e.g. bgwriter or backends, that might have influence over checkpointer's
design as well.

> >What I did not expect, and what confounded me for a long while, is that
> >for workloads where the hot data set does *NOT* fit into shared buffers,
> >sorting often led to be a noticeable reduction in throughput. Up to
> >30%.
> 
> I did not see such behavior in the many tests I ran. Could you share more
> precise details so that I can try to reproduce this performance regression?
> (available memory, shared buffers, db size, ...).


I generally found that I needed to disable autovacuum's analyze to get
anything even close to stable numbers. The issue in described in
http://archives.postgresql.org/message-id/20151031145303.GC6064%40alap3.anarazel.de
otherwise badly kicks in. I basically just set
autovacuum_analyze_threshold to INT_MAX/2147483647 to prevent that from 
occuring.

I'll show actual numbers at some point yes. I tried three different systems:

* my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB
  shared_buffers. Tried checkpoint timeouts from 60 to 300s.  I could
  see issues in workloads ranging from scale 300 to 5000. Throughput
  regressions are visible for both sync_commit on/off workloads. Here
  the largest regressions were visible.

* my workstation: 24GB Ram, 2x E5520, a) Raid 10 of of 4 4TB, 7.2krpm
  devices b) Raid 1 of 2 m4 512GB SSDs. One of the latter was killed
  during the test.  Both showed regressions, but smaller.

* EC2 d2.8xlarge, 244 GB RAM, 24 x 2000 HDD, 64GB shared_buffers. I
  tried scale 3000,8000,15000. Here sorting, without flushing, didn't
  lead much to regressions.


I think generally the regressions were visible with a) noticeable shared
buffers, b) workload not fitting into shared buffers, c) significant
throughput, leading to high cache replacement ratios.


Another thing that's worthwhile to mention, while not surprising, is
that the benefits of this patch are massively smaller when WAL and data
are separated onto different disks.  For workloads fitting into
shared_buffers I saw no performance difference - not particularly
surprising. I guess if you'd construct a case where the data, not WAL,
is the bottleneck that'd be different.  Also worthwhile to mention that
the separate disks setups was noticeably faster.

> >The performance was still much more regular than before, i.e. no
> >more multi-second periods without any transactions happening.
> >
> >By now I think I know what's going on: Before the sorting portion of the
> >patch the write-loop in BufferSync() starts at the current clock hand,
> >by using StrategySyncStart(). But after the sorting that obviously
> >doesn't happen anymore - buffers are accessed in their sort order. By
> >starting at the current clock hand and moving on from there the
> >checkpointer basically makes it more less likely that victim buffers
> >need to be written either by the backends themselves or by
> >bgwriter. That means that the sorted checkpoint writes can, indirectly,
> >increase the number of unsorted writes by other processes :(
> 
> I'm quite surprised at such a large effect on throughput, though.

Me too.


> This explanation seems to suggest that if bgwriter/workders write are sorted
> and/or coordinated with the checkpointer somehow then all would be well?

Well, you can't easily sort bgwriter/backend writes stemming from cache
replacement. Unless your access patterns are entirely sequential the
data in shared buffers will be laid out in a nearly entirely random
order.  We could try sorting the data, but with any reasonable window,
for many workloads the likelihood of actually achieving much with that
seems low.


> ISTM that this explanation could be checked by looking whether
> bgwriter/workers writes are especially large compared to checkpointer writes
> in those cases with reduced throughput? The data is in the log.

What do you mean with "large"? Numerous?

> >My benchmarking suggest that that effect is the larger, the shorter the
> >checkpoint timeout is.
> 
> Hmmm. The shorter the timeout, the more likely the sorting NOT to be
> effective

You mean, as evidenced by the results, or is that what you'd actually
expect?


> >2) Replace the boolean checkpoint_flush_to_disk GUC with a list guc that
> >  later can contain multiple elements like checkpoint, bgwriter,
> >  backends, ddl, bulk-writes. That seems better than adding GUCs for
> >  these separately. Then make the flush locations in the patch
> >  configurable using that.
> 
> My 0,02€ on this point: I have not seen much of this style of guc elsewhere.
> The only one I found while scanning the postgres file are *_path and
> *_libraries. It seems to me that this would depart significantly from the
> usual style, so one guc per case, or one shared guc but with only on/off,
> would blend in more cleanly with the usual style.

Such a guc would allow one 'on' and 'off' setting, and either would
hopefully be the norm. That seems advantageous to me.


> >3) I think we should remove the sort timing from the checkpoint logging
> >  before commit. It'll always be pretty short.
> 
> I added it to show that it was really short, in response to concerns that my
> approach of just sorting through indexes to reduce the memory needed instead
> of copying the data to be sorted did not induce significant performance
> issues. I prooved my point, but peer pressure made me switch to larger
> memory anyway.

Grumble. I'm getting a bit tired about this topic. This wasn't even
remotely primarily about sorting speed, and you damn well know it.


> I think it should be kept while the features are under testing. I do not
> think that it harms in anyway.

That's why I said we should remove it *before commit*.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to