On 7/19/13 3:53 AM, KONDO Mitsumasa wrote:
Recently, a user who think system availability is important uses synchronous replication cluster.
If your argument for why it's OK to ignore bounding crash recovery on the master is that it's possible to failover to a standby, I don't think that is acceptable. PostgreSQL users certainly won't like it.
I want you to read especially point that is line 631, 651, and 656. MAX_WRITEBACK_PAGES is 1024 (1024 * 4096 byte).
You should read http://www.westnet.com/~gsmith/content/linux-pdflush.htm to realize everything you're telling me about the writeback code and its congestion logic I knew back in 2007. The situation is even worse than you describe, because this section of Linux has gone through multiple, major revisions since then. You can't just say "here is the writeback source code"; you have to reference each of the commonly deployed versions of the writeback feature to tell how this is going to play out if released. There are four major ones I pay attention to. The old kernel style as see in RHEL5/2.6.18--that's what my 2007 paper discussed--the similar code but with very different defaults in 2.6.22, the writeback method/tuning in RHEL6/Debian Squeeze/2.6.32, and then there are newer kernels. (The newer ones separate out into a few branches too, I haven't mapped those as carefully yet)
If you tried to model your feature on Linux's approach here, what that means is that the odds of an ugly feedback loop here are even higher. You're increasing the feedback on what's already a bad situation that triggers trouble for people in the field. When Linux's congestion logic causes checkpoint I/O spikes to get worse than they otherwise might be, people panic because it seems like they stopped altogether. There are some examples of what really bad checkpoints look like in http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGCon2011.pdf if you want to see some of them. That's the talk I did around the same time I was trying out spreading the database fsync calls out over a longer period.
When I did that, checkpoints became even less predictable, and that was a major reason behind why I rejected the approach. I think your suggestion will have the same problem. You just aren't generating test cases with really large write workloads yet to see it. You also don't seem afraid of how exceeding the checkpoint timeout is a very bad thing yet.
In addition, if you set a large value of a checkpoint_timeout or checkpoint_complete_taget, you have said that performance is improved, but is it true in all the cases?
The timeout, yes. Throughput is always improved by increasing checkpoint_timeout. Less checkpoints per unit of time increases efficiency. Less writes of the most heavy accessed buffers happen per transaction. It is faster because you are doing less work, which on average is always faster than doing more work. And doing less work usually beats doing more work, but doing it smarter.
If you want to see how much work per transaction a test is doing, track the numbers of buffers written at the beginning/end of your test via pg_stat_bgwriter. Tests that delay checkpoints will show a lower total number of writes per transaction. That seems more efficient, but it's efficiency mainly gained by ignoring checkpoint_timeout.
When a checkpoint complication target is actually enlarged, performance may fall in some cases. I think this as the last fsync having become heavy owing to having write in slowly.
I think you're confusing throughput and latency here. Increasing the checkpoint timeout, or to a lesser extent the completion target, on average that increases throughput. It results in less work, and the more/less work amount is much more important than worrying about scheduler details. Now matter how efficient a given write is, whether you've sorted it across elevator horizon boundary A or boundary B, it's better not do it at all.
But having less checkpoints makes latency worse sometimes too. Whether latency or throughput is considered the more important thing is very complicated. Having checkpoint_completion_target as the knob to control the latency/throughput trade-off hasn't worked out very well. No one has done a really comprehensive look at this trade-off since the 8.3 development. I got halfway through it for 9.1, we figured out that the fsync queue filling was actually responsible for most of my result variation, and then Robert fixed that. It was a big enough change that all my data from before that I had to throw out as no longer relevant.
By the way: if you have a theory like "the last fsync having become heavy" for why something is happening, measure it. Set log_min_messages to debug2 and you'll get details about every single fsync in your logs. I did that for all my tests that led me to conclude fsync delaying on its own didn't help that problem. I was measuring my theories as directly as possible.
I would like to make a itemizing list which can be proof of my patch from you. Because DBT-2 benchmark spent lot of time about 1 setting test per 3 - 4 hours.
That's great, but to add some perspective here I have spent over 1 year of my life running tests like this. The development cycle to do something useful in this area is normally measured in months of machine time running benchmarks, not hours or days. You're doing well so far, but you're just getting started.
My itemized list is simple: throw out all results where the checkpoint end goes more than 5% beyond its targets. When that happens, no matter what you think is causing your gain, I will assume it's actually less total writes that are improving things.
I'm willing to consider an optional, sloppy checkpoint approach that uses heavy load to adjust how often checkpoints happen. But if we're going to do that, it has to be extremely clear that the reason for the gain is the checkpoint spacing--and there is going to be a crash recovery time penalty paid for it. And this patch is not how I would do that.
It's not really clear yet where the gains you're seeing are really coming from. If you re-ran all your tests with pg_stat_bgwriter before/after snapshots, logged every fsync call, and then build some tools to analyze the fsync call latency, then you'll have enough data to talk about this usefully. That's what I consider the bare minimum evidence to consider changing something here. I have all of those features in pgbench-tools with checkpoint logging turned way up, but they're not all in the dbt2 toolset yet as far as I know.
-- Greg Smith 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers