On 2016-01-16 10:01:25 +0100, Fabien COELHO wrote:
> 
> Hello Andres,
> 
> >I measured it in a different number of cases, both on SSDs and spinning
> >rust. I just reproduced it with:
> >
> >postgres-ckpt14 \
> >       -D /srv/temp/pgdev-dev-800/ \
> >       -c maintenance_work_mem=2GB \
> >       -c fsync=on \
> >       -c synchronous_commit=off \
> >       -c shared_buffers=2GB \
> >       -c wal_level=hot_standby \
> >       -c max_wal_senders=10 \
> >       -c max_wal_size=100GB \
> >       -c checkpoint_timeout=30s
> >
> >Using a fresh cluster each time (copied from a "template" to save time)
> >and using
> >pgbench -M prepared -c 16 -j 16 -T 300 -P 1

So, I've analyzed the problem further, and I think I found something
rater interesting. I'd profiled the kernel looking where it blocks in
the IO request queues, and found that the wal writer was involved
surprisingly often.

So, in a workload where everything (checkpoint, bgwriter, backend
writes) is flushed:                             2995 tps
After I kill the wal writer with -STOP:         10887 tps

Stracing the wal writer shows:

17:29:02.001517 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17857, 
si_uid=1000} ---
17:29:02.001538 rt_sigreturn({mask=[]}) = 0
17:29:02.001582 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
unavailable)
17:29:02.001615 write(3, 
"\210\320\5\0\1\0\0\0\0@\330_/\0\0\0w\f\0\0\0\0\0\0\0\4\0\2\t\30\0\372"..., 
49152) = 49152
17:29:02.001671 fdatasync(3)            = 0
17:29:02.005022 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17825, 
si_uid=1000} ---
17:29:02.005043 rt_sigreturn({mask=[]}) = 0
17:29:02.005081 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
unavailable)
17:29:02.005111 write(3, 
"\210\320\5\0\1\0\0\0\0\0\331_/\0\0\0\7\26\0\0\0\0\0\0T\251\0\0\0\0\0\0"..., 
8192) = 8192
17:29:02.005147 fdatasync(3)            = 0
17:29:02.008688 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17866, 
si_uid=1000} ---
17:29:02.008705 rt_sigreturn({mask=[]}) = 0
17:29:02.008730 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
unavailable)
17:29:02.008757 write(3, "\210\320\5\0\1\0\0\0\0 
\331_/\0\0\0\267\30\0\0\0\0\0\0        "..., 98304) = 98304
17:29:02.008822 fdatasync(3)            = 0
17:29:02.016125 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, 
si_uid=1000} ---
17:29:02.016141 rt_sigreturn({mask=[]}) = 0
17:29:02.016174 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
unavailable)
17:29:02.016204 write(3, 
"\210\320\5\0\1\0\0\0\0\240\332_/\0\0\0s\5\0\0\0\0\0\0\t\30\0\2|8\2u"..., 
57344) = 57344
17:29:02.016281 fdatasync(3)            = 0
17:29:02.019181 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, 
si_uid=1000} ---
17:29:02.019199 rt_sigreturn({mask=[]}) = 0
17:29:02.019226 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
unavailable)
17:29:02.019249 write(3, 
"\210\320\5\0\1\0\0\0\0\200\333_/\0\0\0\307\f\0\0\0\0\0\0        "..., 73728) = 
73728
17:29:02.019355 fdatasync(3)            = 0
17:29:02.022680 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, 
si_uid=1000} ---
17:29:02.022696 rt_sigreturn({mask=[]}) = 0

I.e. we're fdatasync()ing small amount of pages. Roughly 500 times a
second. As soon as the wal writer is stopped, it's much bigger chunks,
on the order of 50-130 pages. And, not that surprisingly, that improves
performance, because there's far fewer cache flushes submitted to the
hardware.


> I'm running some tests similar to those above...

> Do you do some warmup when testing? I guess the answer is "no".

Doesn't make a difference here, I tried both. As long as before/after
benchmarks start from the same state...


> I understand that you have 8 cores/16 threads on your host?

On one of them, 4 cores/8 threads on the laptop.


> Loading scale 800 data for 300 seconds tests takes much more than 300
> seconds (init takes ~360 seconds, vacuum & index are slow). With 30 seconds
> checkpoint cycles and without any warmup, I feel that these tests are really
> on the very short (too short) side, so I'm not sure how much I can trust
> such results as significant. The data I reported were with more real life
> like parameters.

I see exactly the same with 300s or 1000s checkpoint cycles, it just
takes a lot longer to repeat. They're also similar (although obviously
both before/after patch are higher) if I disable full_page_writes,
thereby eliminating a lot of other IO.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to