Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Greg Smith Mon, 22 Jul 2013 19:54:03 -0700

On 7/22/13 4:52 AM, KONDO Mitsumasa wrote:

The writeback source code which I indicated part of writeback is almost
same as community kernel (2.6.32.61). I also read linux kernel 3.9.7,
but it is almost same this part.

The main source code difference comes from going back to the RedHat 5kernel, which means 2.6.18. For many of these versions, you are rightthat it is only the tuning parameters that were changed in newer versions.

Optimizing performance for the old RHEL5 kernel isn't the most importantthing, but it's helpful to know the things it does very badly.

My fsync patch is only sleep returned succece of fsync and maximum sleep
time is set to 10 seconds. It does not cause bad for this problem.

It's easy to have hundreds of relations that are getting fsync callsduring a checkpoint. If you have 100 relations getting a 10 secondsleep each, you could potentially delay checkpoints by 17 minutes thisway. I regularly see systems where shared_buffers=8GB and there are 200to 400 relation segments that need a sync during a checkpoint.

This is the biggest problem with your submission. Once you give upfollowing the checkpoint schedule carefully, it is very easy to end upwith large checkpoint deadline misses on production servers. If someonethinks they are doing a checkpoint every 5 minutes, but your patch makesthem take 20 minutes instead, that is bad. They will not expect that acrash might have to replay that much activity before the server isuseful again.

You also don't seem afraid of how exceeding the
checkpoint timeout is a very bad thing yet.

I think it is important that why this problem was caused. We should try
to find the cause of which program has bug or problem.

The checkpointer process is the problem. There's no filesystem bug orcomplicated issues involved in many of the bad cases. Here is a simpleexample that shows how the toughest problem cases happen:


-64GB of RAM
-10% dirty_background_ratio = 6GB of dirty writes = 6144MB
-2MB/s random I/O when concurrent reads are heavy
-3027 seconds to clear the cache = 51 minutes

That's how you get to an example like the one in my slides:

LOG: checkpoint complete: wrote 33282 buers (3.2%); 0 transaction logfile(s) added, 60 removed, 129 recycled; write=228.848 s, sync=4628.879s, total=4858.859 s

It's very hard to do better on these, and I don't expect any change tohelp this a lot. But I don't want to see a change committed that makesthis sort of checkpoint 17 minutes longer if there's 100 relationsinvolved either.

My patch not only improvement of throughput but also
realize stable response time at fsync phase in checkpoint.

The main reason your patch improves latency and throughput is that itmakes checkpoints farther apart. That's why I drew you a graph showinghow the time between checkpoints lined up perfectly with TPS. If it wasonly a small problem it would be worth considering, but I think it'slikely to end up with these >15 minute I've outlined here instead.

And I servey about ext3 file system.

I wouldn't worry too much about the problems ext3 has. Like the oldRHEL5 kernel I was commenting about above, there are a lot of ext3systems out there. But we can't do a lot about getting good performancefrom them. It's only important to test that you're not making them alot worse with a change.

My system block size is 4096, but
8192 or more seems to better. It will decrease number of inode and get
more large sequential disk fields.

I normally increase read-ahead on Linux systems to get faster speed onsequential disk throughput. Changing the block size might work betterin some cases, but not many people are willing to do that. Read-aheadis very easy to change at any time.


--
Greg Smith   2ndQuadrant US    [email protected]   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Reply via email to