On Mon, Jan 18, 2016 at 11:39 AM, Andres Freund <and...@anarazel.de> wrote: > On 2016-01-16 10:01:25 +0100, Fabien COELHO wrote: >> Hello Andres, >> >> >I measured it in a different number of cases, both on SSDs and spinning >> >rust. I just reproduced it with: >> > >> >postgres-ckpt14 \ >> > -D /srv/temp/pgdev-dev-800/ \ >> > -c maintenance_work_mem=2GB \ >> > -c fsync=on \ >> > -c synchronous_commit=off \ >> > -c shared_buffers=2GB \ >> > -c wal_level=hot_standby \ >> > -c max_wal_senders=10 \ >> > -c max_wal_size=100GB \ >> > -c checkpoint_timeout=30s >> > >> >Using a fresh cluster each time (copied from a "template" to save time) >> >and using >> >pgbench -M prepared -c 16 -j 16 -T 300 -P 1 > > So, I've analyzed the problem further, and I think I found something > rater interesting. I'd profiled the kernel looking where it blocks in > the IO request queues, and found that the wal writer was involved > surprisingly often. > > So, in a workload where everything (checkpoint, bgwriter, backend > writes) is flushed: 2995 tps > After I kill the wal writer with -STOP: 10887 tps > > Stracing the wal writer shows: > > 17:29:02.001517 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17857, > si_uid=1000} --- > 17:29:02.001538 rt_sigreturn({mask=[]}) = 0 > 17:29:02.001582 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily > unavailable) > 17:29:02.001615 write(3, > "\210\320\5\0\1\0\0\0\0@\330_/\0\0\0w\f\0\0\0\0\0\0\0\4\0\2\t\30\0\372"..., > 49152) = 49152 > 17:29:02.001671 fdatasync(3) = 0 > 17:29:02.005022 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17825, > si_uid=1000} --- > 17:29:02.005043 rt_sigreturn({mask=[]}) = 0 > 17:29:02.005081 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily > unavailable) > 17:29:02.005111 write(3, > "\210\320\5\0\1\0\0\0\0\0\331_/\0\0\0\7\26\0\0\0\0\0\0T\251\0\0\0\0\0\0"..., > 8192) = 8192 > 17:29:02.005147 fdatasync(3) = 0 > 17:29:02.008688 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17866, > si_uid=1000} --- > 17:29:02.008705 rt_sigreturn({mask=[]}) = 0 > 17:29:02.008730 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily > unavailable) > 17:29:02.008757 write(3, "\210\320\5\0\1\0\0\0\0 > \331_/\0\0\0\267\30\0\0\0\0\0\0 "..., 98304) = 98304 > 17:29:02.008822 fdatasync(3) = 0 > 17:29:02.016125 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, > si_uid=1000} --- > 17:29:02.016141 rt_sigreturn({mask=[]}) = 0 > 17:29:02.016174 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily > unavailable) > 17:29:02.016204 write(3, > "\210\320\5\0\1\0\0\0\0\240\332_/\0\0\0s\5\0\0\0\0\0\0\t\30\0\2|8\2u"..., > 57344) = 57344 > 17:29:02.016281 fdatasync(3) = 0 > 17:29:02.019181 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, > si_uid=1000} --- > 17:29:02.019199 rt_sigreturn({mask=[]}) = 0 > 17:29:02.019226 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily > unavailable) > 17:29:02.019249 write(3, > "\210\320\5\0\1\0\0\0\0\200\333_/\0\0\0\307\f\0\0\0\0\0\0 "..., 73728) > = 73728 > 17:29:02.019355 fdatasync(3) = 0 > 17:29:02.022680 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, > si_uid=1000} --- > 17:29:02.022696 rt_sigreturn({mask=[]}) = 0 > > I.e. we're fdatasync()ing small amount of pages. Roughly 500 times a > second. As soon as the wal writer is stopped, it's much bigger chunks, > on the order of 50-130 pages. And, not that surprisingly, that improves > performance, because there's far fewer cache flushes submitted to the > hardware.
This seems like a problem with the WAL writer quite independent of anything else. It seems likely to be inadvertent fallout from this patch: Author: Simon Riggs <si...@2ndquadrant.com> Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 +0000 Wakeup WALWriter as needed for asynchronous commit performance. Previously we waited for wal_writer_delay before flushing WAL. Now we also wake WALWriter as soon as a WAL buffer page has filled. Significant effect observed on performance of asynchronous commits by Robert Haas, attributed to the ability to set hint bits on tuples earlier and so reducing contention caused by clog lookups. If I understand correctly, prior to that commit, WAL writer woke up 5 times per second and flushed just that often (unless you changed the default settings). But as the commit message explained, that turned out to suck - you could make performance go up very significantly by radically decreasing wal_writer_delay. This commit basically lets it flush at maximum velocity - as fast as we finish one flush, we can start the next. That must have seemed like a win at the time from the way the commit message was written, but you seem to now be seeing the opposite effect, where performance is suffering because flushes are too frequent rather than too infrequent. I wonder if there's an ideal flush rate and what it is, and how much it depends on what hardware you have got. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers