On 2016-01-09 16:49:56 +0100, Fabien COELHO wrote: > > Hello Andres, > > >Hm. New theory: The current flush interface does the flushing inside > >FlushBuffer()->smgrwrite()->mdwrite()->FileWrite()->FlushContextSchedule(). > >The > >problem with that is that at that point we (need to) hold a content lock > >on the buffer! > > You are worrying that FlushBuffer is holding a lock on a buffer and the > "sync_file_range" call occurs is issued at that moment. > > Although I agree that it is not that good, I would be surprise if that was > the explanation for a performance regression, because the sync_file_range > with the chosen parameters is an async call, it "advises" the OS to send the > file, but it does not wait for it to be completed.
I frequently see sync_file_range blocking - it waits till it could submit the writes into the io queues. On a system bottlenecked on IO that's not always possible immediately. > Also, maybe you could answer a question I had about the performance > regression you observed, I could not find the post where you gave the > detailed information about it, so that I could try reproducing it: what are > the exact settings and conditions (shared_buffers, pgbench scaling, host > memory, ...), what is the observed regression (tps? other?), and what is the > responsiveness of the database under the regression (eg % of seconds with 0 > tps for instance, or something like that). I measured it in a different number of cases, both on SSDs and spinning rust. I just reproduced it with: postgres-ckpt14 \ -D /srv/temp/pgdev-dev-800/ \ -c maintenance_work_mem=2GB \ -c fsync=on \ -c synchronous_commit=off \ -c shared_buffers=2GB \ -c wal_level=hot_standby \ -c max_wal_senders=10 \ -c max_wal_size=100GB \ -c checkpoint_timeout=30s Using a fresh cluster each time (copied from a "template" to save time) and using pgbench -M prepared -c 16 -j16 -T 300 -P 1 I get My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram: master: scaling factor: 800 query mode: prepared number of clients: 16 number of threads: 16 duration: 300 s number of transactions actually processed: 1155733 latency average: 4.151 ms latency stddev: 8.712 ms tps = 3851.242965 (including connections establishing) tps = 3851.725856 (excluding connections establishing) ckpt-14 (flushing by backends disabled): scaling factor: 800 query mode: prepared number of clients: 16 number of threads: 16 duration: 300 s number of transactions actually processed: 855156 latency average: 5.612 ms latency stddev: 7.896 ms tps = 2849.876327 (including connections establishing) tps = 2849.912015 (excluding connections establishing) My laptop 1 850 PRO, 1 i7-4800MQ, 16GB ram: master: transaction type: TPC-B (sort of) scaling factor: 800 query mode: prepared number of clients: 16 number of threads: 16 duration: 300 s number of transactions actually processed: 2104781 latency average: 2.280 ms latency stddev: 9.868 ms tps = 7010.397938 (including connections establishing) tps = 7010.475848 (excluding connections establishing) ckpt-14 (flushing by backends disabled): scaling factor: 800 query mode: prepared number of clients: 16 number of threads: 16 duration: 300 s number of transactions actually processed: 1930716 latency average: 2.484 ms latency stddev: 7.303 ms tps = 6434.785605 (including connections establishing) tps = 6435.177773 (excluding connections establishing) In neither case there are periods of 0 tps, but both have times of < 1000 tps with noticeably increased latency. The endresults are similar with a sane checkpoint timeout - the tests just take much longer to give meaningful results. Constantly running long tests on prosumer level SSDs isn't nice - I've now killed 5 SSDs with postgres testing... As you can see there's roughly a 30% performance regression on the slower SSD and a ~9% on the faster one. HDD results are similar (but I can't repeat on the laptop right now since the 2nd hdd is now an SSD). My working copy of checkpoint sorting & flushing currently results in: My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram: transaction type: TPC-B (sort of) scaling factor: 800 query mode: prepared number of clients: 16 number of threads: 16 duration: 300 s number of transactions actually processed: 1136260 latency average: 4.223 ms latency stddev: 8.298 ms tps = 3786.696499 (including connections establishing) tps = 3786.778875 (excluding connections establishing) My laptop 1 850 PRO, 1 i7-4800MQ, 16GB ram: transaction type: TPC-B (sort of) scaling factor: 800 query mode: prepared number of clients: 16 number of threads: 16 duration: 300 s number of transactions actually processed: 2050661 latency average: 2.339 ms latency stddev: 7.708 ms tps = 6833.593170 (including connections establishing) tps = 6833.680391 (excluding connections establishing) My version of the patch currently addresses various points, which need to be separated and benchmarked separate: * Different approach to background writer, trying to make backends write less. While that proves to be beneficial in isolation, on its own that doesn't address the performance regression. * Different flushing API, done outside the lock So this partially addresses the performance problems, but not yet completely. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers