On Jul 14, 2013 9:46 PM, "Greg Smith" <g...@2ndquadrant.com> wrote:
> I updated and re-reviewed that in 2011: 
> http://www.postgresql.org/message-id/4d31ae64.3000...@2ndquadrant.com and 
> commented on why I think the improvement was difficult to reproduce back 
> then.  The improvement didn't follow for me either.  It would take a really 
> amazing bit of data to get me to believe write sorting code is worthwhile 
> after that.  On large systems capable of dirtying enough blocks to cause a 
> problem, the operating system and RAID controllers are already sorting block. 
>  And *that* sorting is also considering concurrent read requests, which are a 
> lot more important to an efficient schedule than anything the checkpoint 
> process knows about.  The database doesn't have nearly enough information yet 
> to compete against OS level sorting.

That reasoning makes no sense. OS level sorting can only see the
writes in the time window between PostgreSQL write, and being forced
to disk. Spread checkpoints sprinkles the writes out over a long
period and the general tuning advice is to heavily bound the amount of
memory the OS willing to keep dirty. This makes probability of
scheduling adjacent writes together quite low, the merging window
being limited either by dirty_bytes or dirty_expire_centisecs. The
checkpointer has the best long term overview of the situation here, OS
scheduling only has the short term view of outstanding read and write
requests. By sorting checkpoint writes it is much more likely that
adjacent blocks are visible to OS writeback at the same time and will
be issued together.

I gave the linked patch a shot. I tried it with pgbench scale 100
concurrency 32, postgresql shared_buffers=3GB,
checkpoint_timeout=5min, checkpoint_segments=100,
checkpoint_completion_target=0.5, pgdata was on a 7200RPM HDD, xlog on
Intel 320 SSD, kernel settings: dirty_background_bytes = 32M,
dirty_bytes = 128M.

first checkpoint on master: wrote 209496 buffers (53.7%); 0
transaction log file(s) added, 0 removed, 26 recycled; write=314.444
s, sync=9.614 s, total=324.166 s; sync files=16, longest=9.208 s,
average=0.600 s
IO while checkpointing: about 500 write iops at 5MB/s, 100% utilisation.

first checkpoint with checkpoint sorting applied: wrote 205269 buffers
(52.6%); 0 transaction log file(s) added, 0 removed, 0 recycled;
write=149.049 s, sync=0.386 s, total=149.559 s; sync files=39,
longest=0.255 s, average=0.009 s
IO while checkpointing: about 23 write iops at 12MB/s, 10% utilisation.

Transaction processing rate for a 20min run went from 5200 to 7000.

Looks to me that in this admittedly best case workload the sorting is
working exactly as designed, converting mostly random IO into
sequential. I have seen many real world workloads where this kind of
sorting would have benefited greatly.

I also did a I/O bound test with scalefactor 100 and
checkpoint_timeout 30min. 2hour average tps went from 121 to 135, but
I'm not yet sure if it's repeatable or just noise.

Ants Aasma
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to