Re: [HACKERS] checkpointer continuous flushing

Robert Haas Tue, 19 Jan 2016 09:59:18 -0800

On Mon, Jan 18, 2016 at 11:39 AM, Andres Freund <[email protected]> wrote:
> On 2016-01-16 10:01:25 +0100, Fabien COELHO wrote:
>> Hello Andres,
>>
>> >I measured it in a different number of cases, both on SSDs and spinning
>> >rust. I just reproduced it with:
>> >
>> >postgres-ckpt14 \
>> >       -D /srv/temp/pgdev-dev-800/ \
>> >       -c maintenance_work_mem=2GB \
>> >       -c fsync=on \
>> >       -c synchronous_commit=off \
>> >       -c shared_buffers=2GB \
>> >       -c wal_level=hot_standby \
>> >       -c max_wal_senders=10 \
>> >       -c max_wal_size=100GB \
>> >       -c checkpoint_timeout=30s
>> >
>> >Using a fresh cluster each time (copied from a "template" to save time)
>> >and using
>> >pgbench -M prepared -c 16 -j 16 -T 300 -P 1
>
> So, I've analyzed the problem further, and I think I found something
> rater interesting. I'd profiled the kernel looking where it blocks in
> the IO request queues, and found that the wal writer was involved
> surprisingly often.
>
> So, in a workload where everything (checkpoint, bgwriter, backend
> writes) is flushed:                             2995 tps
> After I kill the wal writer with -STOP:         10887 tps
>
> Stracing the wal writer shows:
>
> 17:29:02.001517 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17857, 
> si_uid=1000} ---
> 17:29:02.001538 rt_sigreturn({mask=[]}) = 0
> 17:29:02.001582 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
> unavailable)
> 17:29:02.001615 write(3, 
> "\210\320\5\0\1\0\0\0\0@\330_/\0\0\0w\f\0\0\0\0\0\0\0\4\0\2\t\30\0\372"..., 
> 49152) = 49152
> 17:29:02.001671 fdatasync(3)            = 0
> 17:29:02.005022 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17825, 
> si_uid=1000} ---
> 17:29:02.005043 rt_sigreturn({mask=[]}) = 0
> 17:29:02.005081 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
> unavailable)
> 17:29:02.005111 write(3, 
> "\210\320\5\0\1\0\0\0\0\0\331_/\0\0\0\7\26\0\0\0\0\0\0T\251\0\0\0\0\0\0"..., 
> 8192) = 8192
> 17:29:02.005147 fdatasync(3)            = 0
> 17:29:02.008688 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17866, 
> si_uid=1000} ---
> 17:29:02.008705 rt_sigreturn({mask=[]}) = 0
> 17:29:02.008730 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
> unavailable)
> 17:29:02.008757 write(3, "\210\320\5\0\1\0\0\0\0 
> \331_/\0\0\0\267\30\0\0\0\0\0\0        "..., 98304) = 98304
> 17:29:02.008822 fdatasync(3)            = 0
> 17:29:02.016125 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, 
> si_uid=1000} ---
> 17:29:02.016141 rt_sigreturn({mask=[]}) = 0
> 17:29:02.016174 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
> unavailable)
> 17:29:02.016204 write(3, 
> "\210\320\5\0\1\0\0\0\0\240\332_/\0\0\0s\5\0\0\0\0\0\0\t\30\0\2|8\2u"..., 
> 57344) = 57344
> 17:29:02.016281 fdatasync(3)            = 0
> 17:29:02.019181 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, 
> si_uid=1000} ---
> 17:29:02.019199 rt_sigreturn({mask=[]}) = 0
> 17:29:02.019226 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
> unavailable)
> 17:29:02.019249 write(3, 
> "\210\320\5\0\1\0\0\0\0\200\333_/\0\0\0\307\f\0\0\0\0\0\0        "..., 73728) 
> = 73728
> 17:29:02.019355 fdatasync(3)            = 0
> 17:29:02.022680 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, 
> si_uid=1000} ---
> 17:29:02.022696 rt_sigreturn({mask=[]}) = 0
>
> I.e. we're fdatasync()ing small amount of pages. Roughly 500 times a
> second. As soon as the wal writer is stopped, it's much bigger chunks,
> on the order of 50-130 pages. And, not that surprisingly, that improves
> performance, because there's far fewer cache flushes submitted to the
> hardware.


This seems like a problem with the WAL writer quite independent of
anything else.  It seems likely to be inadvertent fallout from this
patch:

Author: Simon Riggs <[email protected]>
Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 +0000

    Wakeup WALWriter as needed for asynchronous commit performance.
    Previously we waited for wal_writer_delay before flushing WAL. Now
    we also wake WALWriter as soon as a WAL buffer page has filled.
    Significant effect observed on performance of asynchronous commits
    by Robert Haas, attributed to the ability to set hint bits on tuples
    earlier and so reducing contention caused by clog lookups.

If I understand correctly, prior to that commit, WAL writer woke up 5
times per second and flushed just that often (unless you changed the
default settings).    But as the commit message explained, that turned
out to suck - you could make performance go up very significantly by
radically decreasing wal_writer_delay.  This commit basically lets it
flush at maximum velocity - as fast as we finish one flush, we can
start the next.  That must have seemed like a win at the time from the
way the commit message was written, but you seem to now be seeing the
opposite effect, where performance is suffering because flushes are
too frequent rather than too infrequent.  I wonder if there's an ideal
flush rate and what it is, and how much it depends on what hardware
you have got.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

Reply via email to