I'm getting increasingly unhappy about the checkpoint flush control.
I saw major regressions on my parallel COPY test, too:

Yes, I'm concerned too.

A few thoughts:

 - focussing on raw tps is not a good idea, because it may be a lot of tps
   followed by a sync panic, with an unresponsive database. I wish the
   performance reports would include some indication of the distribution
   (eg min/q1/median/d3/max tps per second seen, standard deviation), not
   just the final "tps" figure.

 - checkpoint flush control (checkpoint_flush_after) should mostly always
   beneficial because it flushes sorted data. I would be surprised
   to see significant regressions with this on. A lot of tests showed
   maybe improved tps, but mostly greatly improved performance stability,
   where a database unresponsive 60% of the time (60% of seconds in the
   the tps show very low or zero tps) and then becomes always responsive.

 - other flush controls ({backend,bgwriter}_flush_after) may just increase
   random writes, so are more risky in nature because the data is not
   sorted, and it may or may not be a good idea depending on detailed
   conditions. A "parallel copy" would be just such a special IO load
   which degrade performance under these settings.

   Maybe these two should be disabled by default because they lead to
   possibly surprising regressions?

 - for any particular load, the admin can decide to disable these if
   they think it is better not to flush. Also, as suggested by Andres,
   with 128 parallel queries the default value may not be appropriate
   at all.


Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to