Looks like it's time to close the book on this one for 9.1 development...the unfortunate results are at http://www.2ndquadrant.us/pgbench-results/index.htm Test set #12 is the one with spread sync I was hoping would turn out better than #9, the reference I was trying to improve on. TPS is about 5% slower on the scale=500 and 15% slower on the scale=1000 tests with sync spread out. Even worse, maximum latency went up a lot.
I am convinced of a couple of things now:

1) Most of the benefit we were seeing from the original patch I submitted was simply from doing much better at absorbing fsync requests from backends while the checkpoint sync was running. The already committed fsync compaction patch effectively removes that problem though, to the extent it's possible to do so, making the remaining pieces here not as useful in its wake.

2) I need to start over testing here with something that isn't 100% write all of the time the way pgbench is. It's really hard to isolate out latency improvements when the test program guarantees all associated write caches will be completely filled at every moment. Also, I can't see any benefit if I make changes that improve performance only for readers with it, which is quite unrealistic relative to real-world workloads.

3) The existing write spreading code in the background writer needs to be overhauled, too, before spreading the syncs around is going to give the benefits I was hoping for.

Given all that, I'm going to take my feedback and give the test server a much deserved break. I'm happy that the fsync compaction patch has made 9.1 much more tolerant of write-heavy loads than earlier versions, so it's not like no progress was made in this release.

For anyone who wants more details here...the news on this spread sync implementation is not all bad. If you compare this result from HEAD, with scale=1000 and clients=256:

http://www.2ndquadrant.us/pgbench-results/611/index.html

Against its identically configured result with spread sync:

http://www.2ndquadrant.us/pgbench-results/708/index.html

There are actually significantly less times in the >2000 ms latency area. That shows up as a reduction in the 90th percentile latency figures I compute, and you can see it in the graph if you look at how much denser the points are in the 2000 - 4000 ms are on #611. But that's a pretty weak change.

But the most disappointing part here relative to what I was hoping is what happens with bigger buffer caches. The main idea driving this approach was that it would enable larger values of shared_buffers without the checkpoint spikes being as bad. Test set #13 tries that out, by increasing shared_buffers from 256MB to 4GB, along with a big enough increase in checkpoint_segments to make most checkpoints time based. Not only did smaller scale TPS drop in half, all kinds of bad things happened to latency. Here's a sample of the sort of dysfunctional checkpoints that came out of that:

2011-02-10 02:41:17 EST: LOG:  checkpoint starting: xlog
2011-02-10 02:53:15 EST: DEBUG:  checkpoint sync:  estimated segments=22
2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: number=1 file=base/16384/16768 time=150.008 msec 2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: number=2 file=base/16384/16749 time=0.002 msec 2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: number=3 file=base/16384/16749_fsm time=0.001 msec 2011-02-10 02:53:23 EST: DEBUG: checkpoint sync: number=4 file=base/16384/16761 time=8014.102 msec 2011-02-10 02:53:23 EST: DEBUG: checkpoint sync: number=5 file=base/16384/16752_vm time=0.002 msec 2011-02-10 02:53:35 EST: DEBUG: checkpoint sync: number=6 file=base/16384/16761.5 time=11739.038 msec 2011-02-10 02:53:37 EST: DEBUG: checkpoint sync: number=7 file=base/16384/16761.6 time=2205.721 msec 2011-02-10 02:53:45 EST: DEBUG: checkpoint sync: number=8 file=base/16384/16761.2 time=8273.849 msec 2011-02-10 02:54:06 EST: DEBUG: checkpoint sync: number=9 file=base/16384/16766 time=20874.167 msec 2011-02-10 02:54:06 EST: DEBUG: checkpoint sync: number=10 file=base/16384/16762 time=0.002 msec 2011-02-10 02:54:08 EST: DEBUG: checkpoint sync: number=11 file=base/16384/16761.3 time=2440.441 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=12 file=base/16384/16766.1 time=635.839 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=13 file=base/16384/16752_fsm time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=14 file=base/16384/16764 time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=15 file=base/16384/16768_fsm time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=16 file=base/16384/16761_vm time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=17 file=base/16384/16761.4 time=150.702 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=18 file=base/16384/16752 time=0.002 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=19 file=base/16384/16761_fsm time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=20 file=base/16384/16749_vm time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=21 file=base/16384/16385 time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=22 file=base/16384/16761.1 time=175.575 msec 2011-02-10 02:54:10 EST: LOG: checkpoint complete: wrote 242614 buffers (46.3%); 0 transaction log file(s) added, 0 removed, 34 recycled; write=716.637 s, sync=54.659 s, total=772.976 s; sync files=22, longest=20.874 s, average=2.484 s

That's 12 minutes for the write phase, even though checkpoints should be happening every 5 minutes here. With that bad of a write phase overrun, spread sync had no room to work, so no net improvement at all. What is happening here is similar to the behavior I described seeing on my client system but didn't have an example to share until now. During the write phase, looking at "Dirty:" in /proc/meminfo showed the value peaking at over 1GB while writes were happening, and eventually the background writer process wasn't getting any serious CPU time compared to the backends; this is what it looked like via ps:

%CPU     %MEM        TIME+     COMMAND
4 0 01:51.28 /home/gsmith/pgwork/inst/spread-sync/bin/pgbench -f /home/gsmith/pgbench-tools
2    8.1    00:39.71     postgres: gsmith pgbench ::1(43871) UPDATE
2    8    00:39.28     postgres: gsmith pgbench ::1(43875) UPDATE
2    8.1    00:39.92     postgres: gsmith pgbench ::1(43865) UPDATE
2    8.1    00:39.54     postgres: gsmith pgbench ::1(43868) UPDATE
2    8    00:39.36     postgres: gsmith pgbench ::1(43870) INSERT
2    8.1    00:39.47     postgres: gsmith pgbench ::1(43877) UPDATE
1    8    00:39.39     postgres: gsmith pgbench ::1(43864) COMMIT
1    8.1    00:39.78     postgres: gsmith pgbench ::1(43866) UPDATE
1    8    00:38.99     postgres: gsmith pgbench ::1(43867) UPDATE
1    8.1    00:39.55     postgres: gsmith pgbench ::1(43872) UPDATE
1    8.1    00:39.90     postgres: gsmith pgbench ::1(43873) UPDATE
1    8.1    00:39.64     postgres: gsmith pgbench ::1(43876) UPDATE
1    8.1    00:39.93     postgres: gsmith pgbench ::1(43878) UPDATE
1    8.1    00:39.83     postgres: gsmith pgbench ::1(43863) UPDATE
1    8    00:39.47     postgres: gsmith pgbench ::1(43869) UPDATE
1    8.1    00:40.11     postgres: gsmith pgbench ::1(43874) UPDATE
1    0    00:11.91     [flush-9:1]
0    0    27:43.75     [xfsdatad/6]
0    9.4    00:02.21     postgres: writer process

I want to make this problem go away, but as you can see spreading the sync calls around isn't enough. I think the main write loop needs to get spread out more, too, so that the background writer is trying to work at a more reasonable pace. I am pleased I've been able to reproduce this painful behavior at home using test data, because that much improves my odds of being able to isolate its cause and test solutions. But it's a tricky problem, and I'm certainly not going to fix it in the next week.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to