Re: [HACKERS] WIP(!) Double Writes

Greg Smith Mon, 09 Jan 2012 20:16:43 -0800

On 1/5/12 1:19 AM, David Fetter wrote:

To achieve efficiency, the checkpoint writer and bgwriter should batch
writes to multiple pages together.  Currently, there is an option
"batched_buffer_writes" that specifies how many buffers to batch at a
time.  However, we may want to remove that option from view, and just
force batched_buffer_writes to a default (32) if double_writes is
enabled.

The idea that PostgreSQL has better information about how to batchwrites than the layers below it is controversial, and has failed tomatch expectations altogether for me in many cases. The nastiestregressions here I ran into were in VACUUM, where the ring bufferimplementation means the database has extremely limited room to work.Just dumping the whole write mess of that into a large OS cache asquickly as possible, and letting it sort things out, was dramaticallyfaster in some of my test cases. If you don't have one already, I'drecommend adding a performance test that dirties a lot of pages and thenruns VACUUM against them to your test suite. Since you're not cripplingthe OS cache to the same extent I was the problem may not be so bad, butit's something worth checking.

I scribbled some notes on this problem area athttp://blog.2ndquadrant.com/en/2011/01/tuning-linux-for-low-postgresq.html; the links that are broken due to our web site being rearranged are nowat http://highperfpostgres.com/pgbench-results/index.htm (test summary)and http://www.highperfpostgres.com/pgbench-results/435/index.html(Really bad latency spike example)

Given the batching functionality, double writes by the checkpoint
writer (and bgwriter) is implemented efficiently by writing a batch of
pages to the double-write file and fsyncing, and then writing the
pages to the appropriate data files, and fsyncing all the necessary
data files.  While the data fsyncing might be viewed as expensive, it
does help eliminate a lot of the fsync overhead at the end of
checkpoints.  FlushRelationBuffers() and FlushDatabaseBuffers() can be
similarly batched.

There's a fundamental struggle here between latency and throughput. Thelonger you delay between writes and their subsequent sync, the more theOS gets a chance to reorder and combine them for better throughput.Ditto for any storage level optimizations, controller write caches andthe like. All that increases throughput, and more batching helps movein that direction. But when you overload those caches and writes won'tsqueeze into them anymore...now there's a latency spike. And asthroughput increases, with it goes the amount of dirty cache that needsto be cleared per unit of time.

Eventually, all this disk I/O turns into a series of random writes. Youcan postpone those in various ways, resequence them in ways that helpsome tests. But if they're the true bottleneck, eventually all cacheswill fill, and clients will be stuck waiting for them. And it's hard toimagine anything that causes the amount of data written to increase toever move that problem in the right direction for the worst case.Adjusting the sync sequence just moves the problem to somewhere else.If you get lucky, that's a better place most of the time; how that betturns out will be very workload dependent though. I've lost a lot ofthose bets when trying to resequence syncs in the last two years, wherebenefits were extremely test dependent.

We have some other code (not included) that sorts buffers to be
checkpointed in file/block order -- this can reduce fsync overhead
further by ensuring that each batch writes to only one or a few data
files.

Again, the database doesn't necessarily have the information to makethis level of decision better than the underlying layers do. We've beenthrough two runs at this idea already that ended inconclusively. Theone I did last year you can see athttp://highperfpostgres.com/pgbench-results/index.htm ; set 9 and 11 arethe same test without (9) and with (11) write sorting. If there'sreally a difference there, it's below the noise floor as far as I couldsee. Whether sorting helps or hurts is both workload and hardwaredependent.

As Jignesh has mentioned on this list, we see significant performance
gains when enabling double writes&  disabling full_page_writes for
OLTP runs with sufficient buffer cache size.  We are now trying to
measure some runs where the dirty buffer eviction rate by the backends
is high.

We'd need to have positive results published along with a publiclyreproducible benchmark to go at this usefully. I aimed for a muchsmaller goal than this in a similar area, around this same time lastyear. I didn't get very far down that path before 9.1 developmentclosed; it just takes too long to run enough benchmarks to reallyvalidate performance code in the write path. This is a pretty obtrusivechange to drop into the codebase for 9.2 at this point in thedevelopment cycle.

P.S. I got the impression you're testing these changes primarily againsta modified 9.0. One of the things that came out of the 9.1 performancetesting was the "compact fsync queue" modification. That significantimprovement rippled out enough that several things that used to matterin my tests didn't anymore, once it was committed. If your baselinedoesn't include that feature already, you may have an uphill battle toprove any performance gains you've been seeing will still happen in thecurrent 9.2 code. Performance for that version has advanced evenfurther forward in ways 9.0 can't emulate.


--
Greg Smith   2ndQuadrant US    [email protected]   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP(!) Double Writes

Reply via email to