Hello Andres,

I think you're misunderstanding how spread checkpoints work.

Yep, definitely:-) On the other hand I though I was seeking something "simple", namely correct latency under small load, that I would expect out of the box.

What you describe is reasonable, and is more or less what I was hoping for, although I thought that bgwriter was involved from the start and checkpoint would only do what is needed in the end. My mistake.

When the checkpointer process starts a spread checkpoint it first writes all buffers to the kernel in a paced manner. That pace is determined by checkpoint_completion_target and checkpoint_timeout.

This pacing does not seem to work, even at slow pace.

If you have a stall of roughly the same magnitude (say a factor
of two different), the smaller once a minute, the larger once an
hour. Obviously the once-an-hour one will have a better latency in many,
many more transactions.

I do not believe in delaying as much as possible writing do disk to handle a small load as a viable strategy. However, to show my good will, I have tried to follow your advices: I've launched a 5000 seconds test with 50 segments, 30 min timeout, 0.9 completion target, at 25 tps, which is less than 1/10 of the maximum throughput.

There are only two time-triggered checkpoints:

  LOG:  checkpoint starting: time
  LOG:  checkpoint complete: wrote 48725 buffers (47.6%);
      1 transaction log file(s) added, 0 removed, 0 recycled;
      write=1619.750 s, sync=27.675 s, total=1647.932 s;
      sync files=14, longest=27.593 s, average=1.976 s

  LOG:  checkpoint starting: time
  LOG:  checkpoint complete: wrote 22533 buffers (22.0%);
      0 transaction log file(s) added, 0 removed, 23 recycled;
      write=826.919 s, sync=9.989 s, total=837.023 s;
      sync files=8, longest=6.742 s, average=1.248 s

For the first one, 48725 buffers is 380MB. 1800 * 0.9 = 1620 seconds to complete, so it means 30 buffer writes per second... should be ok. However sync costs 27 seconds nevertheless, and the server was more or less offline for about 30 seconds flat. For the second one, 180 MB to write, 10 seconds offline. For some reason the target time is reduced. I have also tried with the "deadline" IO scheduler which make more sense than the default "cfq", but the result was similar. Not sure how software RAID interacts with IO scheduling, though.

Overall result: over the 5000s test, I have lost (i.e. more than 200ms behind schedule) more than 2.5% of transactions (1/40). Due to the unfinished cycle, the long term average is probably about 3%. Although it is better than 10%, it is not good. I would expect/hope for something pretty close to 0, even with ext4 on Linux, for a dedicated host which has nothing else to do but handle two dozen transactions per second.

Current conclusion: I have not found any way to improve the situation to "good" with parameters from the configuration. Currently a small load results in periodic offline time, that can be delayed but not avoided. The delaying tactic results in less frequent but longer downtime. I prefer frequent very short downtime instead.

I really think that something is amiss. Maybe pg does not handle pacing as it should.

For the record, a 25tps bench with a "small" config (default 3 segments, 5min timeout, 0.5 completion target) and with a parallel:

        while true ; do echo "CHECKPOINT;"; sleep 0.2s; done | psql

results in "losing" only 0.01% of transactions (12 transactions out of 125893 where behind more than 200ms in 5000 seconds). Although you may think it stupid, from my point of view it shows that it is possible to coerce pg to behave.

With respect to the current status:

(1) the ability to put checkpoint_timeout to values smaller than 30s could help, although obviously there would be other consequences. But the ability to avoid periodic offline time looks like a desirable objective.

(2) I still think that a parameter to force bgwriter to write more stuff could help, but this is not tested.

(3) Any other effective idea to configure for responsiveness is welcome!

If someone wants to repeat these tests, it is easy and only takes a few minutes:

  sh> createdb test
  sh> pgbench -i -s 100 -F 95 test
  sh> pgbench -M prepared -N -R 25 -L 200 -c 2 -T 5000 -P 1 test > pgb.out

Note: the -L to limit latency is a submitted patch. Without this, unresponsiveness shows as increasing laging time.


Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to