Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Greg Smith Fri, 19 Jul 2013 06:49:41 -0700

On 7/19/13 3:53 AM, KONDO Mitsumasa wrote:

Recently, a user who think system availability is important uses
synchronous replication cluster.

If your argument for why it's OK to ignore bounding crash recovery onthe master is that it's possible to failover to a standby, I don't thinkthat is acceptable. PostgreSQL users certainly won't like it.

I want you to read especially point that is line 631, 651, and 656.
MAX_WRITEBACK_PAGES is 1024 (1024 * 4096 byte).

You should read http://www.westnet.com/~gsmith/content/linux-pdflush.htmto realize everything you're telling me about the writeback code and itscongestion logic I knew back in 2007. The situation is even worse thanyou describe, because this section of Linux has gone through multiple,major revisions since then. You can't just say "here is the writebacksource code"; you have to reference each of the commonly deployedversions of the writeback feature to tell how this is going to play outif released. There are four major ones I pay attention to. The oldkernel style as see in RHEL5/2.6.18--that's what my 2007 paperdiscussed--the similar code but with very different defaults in 2.6.22,the writeback method/tuning in RHEL6/Debian Squeeze/2.6.32, and thenthere are newer kernels. (The newer ones separate out into a fewbranches too, I haven't mapped those as carefully yet)

If you tried to model your feature on Linux's approach here, what thatmeans is that the odds of an ugly feedback loop here are even higher.You're increasing the feedback on what's already a bad situation thattriggers trouble for people in the field. When Linux's congestion logiccauses checkpoint I/O spikes to get worse than they otherwise might be,people panic because it seems like they stopped altogether. There aresome examples of what really bad checkpoints look like inhttp://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGCon2011.pdfif you want to see some of them. That's the talk I did around the sametime I was trying out spreading the database fsync calls out over alonger period.

When I did that, checkpoints became even less predictable, and that wasa major reason behind why I rejected the approach. I think yoursuggestion will have the same problem. You just aren't generating testcases with really large write workloads yet to see it. You also don'tseem afraid of how exceeding the checkpoint timeout is a very bad thing yet.

In addition, if you set a large value of a checkpoint_timeout or
checkpoint_complete_taget, you have said that performance is improved,
but is it true in all the cases?

The timeout, yes. Throughput is always improved by increasingcheckpoint_timeout. Less checkpoints per unit of time increasesefficiency. Less writes of the most heavy accessed buffers happen pertransaction. It is faster because you are doing less work, which onaverage is always faster than doing more work. And doing less workusually beats doing more work, but doing it smarter.

If you want to see how much work per transaction a test is doing, trackthe numbers of buffers written at the beginning/end of your test viapg_stat_bgwriter. Tests that delay checkpoints will show a lower totalnumber of writes per transaction. That seems more efficient, but it'sefficiency mainly gained by ignoring checkpoint_timeout.

When a checkpoint complication target is actually enlarged,
performance may fall in some cases. I think this as the last fsync
having become heavy owing to having write in slowly.

I think you're confusing throughput and latency here. Increasing thecheckpoint timeout, or to a lesser extent the completion target, onaverage that increases throughput. It results in less work, and themore/less work amount is much more important than worrying aboutscheduler details. Now matter how efficient a given write is, whetheryou've sorted it across elevator horizon boundary A or boundary B, it'sbetter not do it at all.

But having less checkpoints makes latency worse sometimes too. Whetherlatency or throughput is considered the more important thing is verycomplicated. Having checkpoint_completion_target as the knob to controlthe latency/throughput trade-off hasn't worked out very well. No onehas done a really comprehensive look at this trade-off since the 8.3development. I got halfway through it for 9.1, we figured out that thefsync queue filling was actually responsible for most of my resultvariation, and then Robert fixed that. It was a big enough change thatall my data from before that I had to throw out as no longer relevant.

By the way: if you have a theory like "the last fsync having becomeheavy" for why something is happening, measure it. Set log_min_messagesto debug2 and you'll get details about every single fsync in your logs.I did that for all my tests that led me to conclude fsync delaying onits own didn't help that problem. I was measuring my theories asdirectly as possible.

I would like to make a itemizing list which can be proof of my patch
from you. Because DBT-2 benchmark spent lot of time about 1 setting test
per 3 - 4 hours.

That's great, but to add some perspective here I have spent over 1 yearof my life running tests like this. The development cycle to dosomething useful in this area is normally measured in months of machinetime running benchmarks, not hours or days. You're doing well so far,but you're just getting started.

My itemized list is simple: throw out all results where the checkpointend goes more than 5% beyond its targets. When that happens, no matterwhat you think is causing your gain, I will assume it's actually lesstotal writes that are improving things.

I'm willing to consider an optional, sloppy checkpoint approach thatuses heavy load to adjust how often checkpoints happen. But if we'regoing to do that, it has to be extremely clear that the reason for thegain is the checkpoint spacing--and there is going to be a crashrecovery time penalty paid for it. And this patch is not how I would dothat.

It's not really clear yet where the gains you're seeing are reallycoming from. If you re-ran all your tests with pg_stat_bgwriterbefore/after snapshots, logged every fsync call, and then build sometools to analyze the fsync call latency, then you'll have enough data totalk about this usefully. That's what I consider the bare minimumevidence to consider changing something here. I have all of thosefeatures in pgbench-tools with checkpoint logging turned way up, butthey're not all in the dbt2 toolset yet as far as I know.


--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Reply via email to