On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
If the kernel can treat sequential writes better than random writes, is it worth sorting dirty buffers in block order per file at the start of checkpoints?
I think it has the potential to improve things. There are three obvious and one subtle argument against it I can think of:
1) Extra complexity for something that may not help. This would need some good, robust benchmarking improvements to justify its use.
2) Block number ordering may not reflect actual order on disk. While true, it's got to be better correlated with it than writing at random.
3) The OS disk elevator should be dealing with this issue, particularly because it may really know the actual disk ordering.
Here's the subtle thing: by writing in the same order the LRU scan occurs in, you are writing dirty buffers in the optimal fashion to eliminate client backend writes during BuferAlloc. This makes the checkpoint a really effective LRU clearing mechanism. Writing in block order will change that.
I spent some time trying to optimize the elevator part of this operation, since I knew that on the system I was using block order was actual order. I found that under Linux, the behavior of the pdflush daemon that manages dirty memory had a more serious impact on writing behavior at checkpoint time than playing with the elevator scheduling method did. The way pdflush works actually has several interesting implications for how to optimize this patch. For example, how writes get blocked when the dirty memory reaches certain thresholds means that you may not get the full benefit of the disk elevator at checkpoint time the way most would expect.
Since much of that was basically undocumented, I had to write my own analysis of the actual workings, which is now available at http://www.westnet.com/~gsmith/content/linux-pdflush.htm I hope that anyone who wants more information about how Linux kernel parameters like dirty_background_ratio actually work, and how they impact the writing strategy, should find that article uniquely helpful.
Some kernels or storage subsystems treat all I/Os too fairly so that user transactions waiting for reads are blocked by checkpoints writes.
In addition to that (which I've seen happen quite a bit), in the Linux case another fairness issue is that the code that handles writes allows a single process writing a lot of data to block writes for everyone else. That means that in addition to being blocked on actual reads, if a client backend starts a write in order to complete a buffer allocation to hold new information, that can grind to a halt because of the checkpoint process as well.
-- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate