On Jul 14, 2013 9:46 PM, "Greg Smith" <g...@2ndquadrant.com> wrote: > I updated and re-reviewed that in 2011: > http://www.postgresql.org/message-id/4d31ae64.3000...@2ndquadrant.com and > commented on why I think the improvement was difficult to reproduce back > then. The improvement didn't follow for me either. It would take a really > amazing bit of data to get me to believe write sorting code is worthwhile > after that. On large systems capable of dirtying enough blocks to cause a > problem, the operating system and RAID controllers are already sorting block. > And *that* sorting is also considering concurrent read requests, which are a > lot more important to an efficient schedule than anything the checkpoint > process knows about. The database doesn't have nearly enough information yet > to compete against OS level sorting.
That reasoning makes no sense. OS level sorting can only see the writes in the time window between PostgreSQL write, and being forced to disk. Spread checkpoints sprinkles the writes out over a long period and the general tuning advice is to heavily bound the amount of memory the OS willing to keep dirty. This makes probability of scheduling adjacent writes together quite low, the merging window being limited either by dirty_bytes or dirty_expire_centisecs. The checkpointer has the best long term overview of the situation here, OS scheduling only has the short term view of outstanding read and write requests. By sorting checkpoint writes it is much more likely that adjacent blocks are visible to OS writeback at the same time and will be issued together. I gave the linked patch a shot. I tried it with pgbench scale 100 concurrency 32, postgresql shared_buffers=3GB, checkpoint_timeout=5min, checkpoint_segments=100, checkpoint_completion_target=0.5, pgdata was on a 7200RPM HDD, xlog on Intel 320 SSD, kernel settings: dirty_background_bytes = 32M, dirty_bytes = 128M. first checkpoint on master: wrote 209496 buffers (53.7%); 0 transaction log file(s) added, 0 removed, 26 recycled; write=314.444 s, sync=9.614 s, total=324.166 s; sync files=16, longest=9.208 s, average=0.600 s IO while checkpointing: about 500 write iops at 5MB/s, 100% utilisation. first checkpoint with checkpoint sorting applied: wrote 205269 buffers (52.6%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=149.049 s, sync=0.386 s, total=149.559 s; sync files=39, longest=0.255 s, average=0.009 s IO while checkpointing: about 23 write iops at 12MB/s, 10% utilisation. Transaction processing rate for a 20min run went from 5200 to 7000. Looks to me that in this admittedly best case workload the sorting is working exactly as designed, converting mostly random IO into sequential. I have seen many real world workloads where this kind of sorting would have benefited greatly. I also did a I/O bound test with scalefactor 100 and checkpoint_timeout 30min. 2hour average tps went from 121 to 135, but I'm not yet sure if it's repeatable or just noise. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers