I think you're misunderstanding how spread checkpoints work.
Yep, definitely:-) On the other hand I though I was seeking something
"simple", namely correct latency under small load, that I would expect out
of the box.
What you describe is reasonable, and is more or less what I was hoping
for, although I thought that bgwriter was involved from the start and
checkpoint would only do what is needed in the end. My mistake.
When the checkpointer process starts a spread checkpoint it first writes
all buffers to the kernel in a paced manner.
That pace is determined by checkpoint_completion_target and
This pacing does not seem to work, even at slow pace.
If you have a stall of roughly the same magnitude (say a factor
of two different), the smaller once a minute, the larger once an
hour. Obviously the once-an-hour one will have a better latency in many,
many more transactions.
I do not believe in delaying as much as possible writing do disk to handle
a small load as a viable strategy. However, to show my good will, I have
tried to follow your advices: I've launched a 5000 seconds test with 50
segments, 30 min timeout, 0.9 completion target, at 25 tps, which is less
than 1/10 of the maximum throughput.
There are only two time-triggered checkpoints:
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 48725 buffers (47.6%);
1 transaction log file(s) added, 0 removed, 0 recycled;
write=1619.750 s, sync=27.675 s, total=1647.932 s;
sync files=14, longest=27.593 s, average=1.976 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 22533 buffers (22.0%);
0 transaction log file(s) added, 0 removed, 23 recycled;
write=826.919 s, sync=9.989 s, total=837.023 s;
sync files=8, longest=6.742 s, average=1.248 s
For the first one, 48725 buffers is 380MB. 1800 * 0.9 = 1620 seconds to
complete, so it means 30 buffer writes per second... should be ok. However
sync costs 27 seconds nevertheless, and the server was more or less
offline for about 30 seconds flat. For the second one, 180 MB to write, 10
seconds offline. For some reason the target time is reduced. I have also
tried with the "deadline" IO scheduler which make more sense than the
default "cfq", but the result was similar. Not sure how software RAID
interacts with IO scheduling, though.
Overall result: over the 5000s test, I have lost (i.e. more than 200ms
behind schedule) more than 2.5% of transactions (1/40). Due to the
unfinished cycle, the long term average is probably about 3%. Although it
is better than 10%, it is not good. I would expect/hope for something
pretty close to 0, even with ext4 on Linux, for a dedicated host which has
nothing else to do but handle two dozen transactions per second.
Current conclusion: I have not found any way to improve the situation to
"good" with parameters from the configuration. Currently a small load
results in periodic offline time, that can be delayed but not avoided. The
delaying tactic results in less frequent but longer downtime. I prefer
frequent very short downtime instead.
I really think that something is amiss. Maybe pg does not handle pacing as
For the record, a 25tps bench with a "small" config (default 3 segments,
5min timeout, 0.5 completion target) and with a parallel:
while true ; do echo "CHECKPOINT;"; sleep 0.2s; done | psql
results in "losing" only 0.01% of transactions (12 transactions out of
125893 where behind more than 200ms in 5000 seconds). Although you may
think it stupid, from my point of view it shows that it is possible to
coerce pg to behave.
With respect to the current status:
(1) the ability to put checkpoint_timeout to values smaller than 30s could
help, although obviously there would be other consequences. But the
ability to avoid periodic offline time looks like a desirable objective.
(2) I still think that a parameter to force bgwriter to write more stuff
could help, but this is not tested.
(3) Any other effective idea to configure for responsiveness is welcome!
If someone wants to repeat these tests, it is easy and only takes a few
sh> createdb test
sh> pgbench -i -s 100 -F 95 test
sh> pgbench -M prepared -N -R 25 -L 200 -c 2 -T 5000 -P 1 test > pgb.out
Note: the -L to limit latency is a submitted patch. Without this,
unresponsiveness shows as increasing laging time.
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: