Re: [HACKERS] Spread checkpoint sync

Greg Smith Thu, 10 Feb 2011 19:31:47 -0800

Looks like it's time to close the book on this one for 9.1development...the unfortunate results are athttp://www.2ndquadrant.us/pgbench-results/index.htm Test set #12 is theone with spread sync I was hoping would turn out better than #9, thereference I was trying to improve on. TPS is about 5% slower on thescale=500 and 15% slower on the scale=1000 tests with sync spread out.Even worse, maximum latency went up a lot.

I am convinced of a couple of things now:

1) Most of the benefit we were seeing from the original patch Isubmitted was simply from doing much better at absorbing fsync requestsfrom backends while the checkpoint sync was running. The alreadycommitted fsync compaction patch effectively removes that problemthough, to the extent it's possible to do so, making the remainingpieces here not as useful in its wake.

2) I need to start over testing here with something that isn't 100%write all of the time the way pgbench is. It's really hard to isolateout latency improvements when the test program guarantees all associatedwrite caches will be completely filled at every moment. Also, I can'tsee any benefit if I make changes that improve performance only forreaders with it, which is quite unrealistic relative to real-worldworkloads.

3) The existing write spreading code in the background writer needs tobe overhauled, too, before spreading the syncs around is going to givethe benefits I was hoping for.

Given all that, I'm going to take my feedback and give the test server amuch deserved break. I'm happy that the fsync compaction patch has made9.1 much more tolerant of write-heavy loads than earlier versions, soit's not like no progress was made in this release.

For anyone who wants more details here...the news on this spread syncimplementation is not all bad. If you compare this result from HEAD,with scale=1000 and clients=256:


http://www.2ndquadrant.us/pgbench-results/611/index.html

Against its identically configured result with spread sync:

http://www.2ndquadrant.us/pgbench-results/708/index.html

There are actually significantly less times in the >2000 ms latencyarea. That shows up as a reduction in the 90th percentile latencyfigures I compute, and you can see it in the graph if you look at howmuch denser the points are in the 2000 - 4000 ms are on #611. Butthat's a pretty weak change.

But the most disappointing part here relative to what I was hoping iswhat happens with bigger buffer caches. The main idea driving thisapproach was that it would enable larger values of shared_bufferswithout the checkpoint spikes being as bad. Test set #13 tries thatout, by increasing shared_buffers from 256MB to 4GB, along with a bigenough increase in checkpoint_segments to make most checkpoints timebased. Not only did smaller scale TPS drop in half, all kinds of badthings happened to latency. Here's a sample of the sort ofdysfunctional checkpoints that came out of that:


2011-02-10 02:41:17 EST: LOG:  checkpoint starting: xlog
2011-02-10 02:53:15 EST: DEBUG:  checkpoint sync:  estimated segments=22

2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: number=1file=base/16384/16768 time=150.008 msec2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: number=2file=base/16384/16749 time=0.002 msec2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: number=3file=base/16384/16749_fsm time=0.001 msec2011-02-10 02:53:23 EST: DEBUG: checkpoint sync: number=4file=base/16384/16761 time=8014.102 msec2011-02-10 02:53:23 EST: DEBUG: checkpoint sync: number=5file=base/16384/16752_vm time=0.002 msec2011-02-10 02:53:35 EST: DEBUG: checkpoint sync: number=6file=base/16384/16761.5 time=11739.038 msec2011-02-10 02:53:37 EST: DEBUG: checkpoint sync: number=7file=base/16384/16761.6 time=2205.721 msec2011-02-10 02:53:45 EST: DEBUG: checkpoint sync: number=8file=base/16384/16761.2 time=8273.849 msec2011-02-10 02:54:06 EST: DEBUG: checkpoint sync: number=9file=base/16384/16766 time=20874.167 msec2011-02-10 02:54:06 EST: DEBUG: checkpoint sync: number=10file=base/16384/16762 time=0.002 msec2011-02-10 02:54:08 EST: DEBUG: checkpoint sync: number=11file=base/16384/16761.3 time=2440.441 msec2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=12file=base/16384/16766.1 time=635.839 msec2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=13file=base/16384/16752_fsm time=0.001 msec2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=14file=base/16384/16764 time=0.001 msec2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=15file=base/16384/16768_fsm time=0.001 msec2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=16file=base/16384/16761_vm time=0.001 msec2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=17file=base/16384/16761.4 time=150.702 msec2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=18file=base/16384/16752 time=0.002 msec2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=19file=base/16384/16761_fsm time=0.001 msec2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=20file=base/16384/16749_vm time=0.001 msec2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=21file=base/16384/16385 time=0.001 msec2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=22file=base/16384/16761.1 time=175.575 msec2011-02-10 02:54:10 EST: LOG: checkpoint complete: wrote 242614 buffers(46.3%); 0 transaction log file(s) added, 0 removed, 34 recycled;write=716.637 s, sync=54.659 s, total=772.976 s; sync files=22,longest=20.874 s, average=2.484 s

That's 12 minutes for the write phase, even though checkpoints should behappening every 5 minutes here. With that bad of a write phase overrun,spread sync had no room to work, so no net improvement at all. What ishappening here is similar to the behavior I described seeing on myclient system but didn't have an example to share until now. During thewrite phase, looking at "Dirty:" in /proc/meminfo showed the valuepeaking at over 1GB while writes were happening, and eventually thebackground writer process wasn't getting any serious CPU time comparedto the backends; this is what it looked like via ps:


%CPU     %MEM        TIME+     COMMAND

4 0 01:51.28 /home/gsmith/pgwork/inst/spread-sync/bin/pgbench-f /home/gsmith/pgbench-tools

2    8.1    00:39.71     postgres: gsmith pgbench ::1(43871) UPDATE
2    8    00:39.28     postgres: gsmith pgbench ::1(43875) UPDATE
2    8.1    00:39.92     postgres: gsmith pgbench ::1(43865) UPDATE
2    8.1    00:39.54     postgres: gsmith pgbench ::1(43868) UPDATE
2    8    00:39.36     postgres: gsmith pgbench ::1(43870) INSERT
2    8.1    00:39.47     postgres: gsmith pgbench ::1(43877) UPDATE
1    8    00:39.39     postgres: gsmith pgbench ::1(43864) COMMIT
1    8.1    00:39.78     postgres: gsmith pgbench ::1(43866) UPDATE
1    8    00:38.99     postgres: gsmith pgbench ::1(43867) UPDATE
1    8.1    00:39.55     postgres: gsmith pgbench ::1(43872) UPDATE
1    8.1    00:39.90     postgres: gsmith pgbench ::1(43873) UPDATE
1    8.1    00:39.64     postgres: gsmith pgbench ::1(43876) UPDATE
1    8.1    00:39.93     postgres: gsmith pgbench ::1(43878) UPDATE
1    8.1    00:39.83     postgres: gsmith pgbench ::1(43863) UPDATE
1    8    00:39.47     postgres: gsmith pgbench ::1(43869) UPDATE
1    8.1    00:40.11     postgres: gsmith pgbench ::1(43874) UPDATE
1    0    00:11.91     [flush-9:1]
0    0    27:43.75     [xfsdatad/6]
0    9.4    00:02.21     postgres: writer process

I want to make this problem go away, but as you can see spreading thesync calls around isn't enough. I think the main write loop needs toget spread out more, too, so that the background writer is trying towork at a more reasonable pace. I am pleased I've been able toreproduce this painful behavior at home using test data, because thatmuch improves my odds of being able to isolate its cause and testsolutions. But it's a tricky problem, and I'm certainly not going tofix it in the next week.


--
Greg Smith   2ndQuadrant US    [email protected]   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

Reply via email to