Hi, Here is what I've discovered so far from looking at the blktrace info I captured for 5mins of the untar process....
1. Log flush This generates lots of 4k writes. These are at least contiguous and are likely to be the source of most of the merges that we see in the stats. So not a big issue overall, since they are getting merged up and the log flushes themselves seem to take only a fairly short time. Even a long one doesn't seem to last more than 1/10th sec. There is already a patch upstream which will result in much larger log flush i/o requests. 2. AIL flush There were only a couple of these during the whole 5 min period, and they didn't take much time. At some stage we should look at trying to sort the i/o generated by AIL flushes, but this is tricky until we can reorganise the log code a bit. 3. Data writeback This is now (with the extra patch Bob included) getting merged up nicely and creating large i/os. We see these as part of log flush (when flushing ordered data) and also when writes from tar are being rate limited. All in all looks ok. 4. Glock shinking This generates i/o when the glocks are demoted as part of the shrinking process. There are a number of bursts of this i/o, and it looks like it is at least mostly in the right order. That is most likely to due to filling the fs in order and thus also filing the glocks on the lru list in the same order, leading to reaping them in order too. There is an upstream patch that will help to ensure correct ordering here, however it doesn't look like this is an issue 5. Metadata writeback The looks like the troublesome one.... this is generating a lot of single block, rather random looking i/o requests. These appear to be triggered from the flusher calling writepage on the metadata. Since we use separate metadata address spaces for each inode's metadata this is going to be a tricky on to solve. Potentially we might be able to just skip some of the writeback since I suspect that we have WB_SYNC_NONE in that case. So we need to look for some solutions to this I think. ... and thats about it. I can't spot any other generators of i/o beyond the odd block or two, so I suspect that #5 is the main thing we should be looking at. Having said that, the performance criteria for the US Army to use GFS2 is not going to be affected by this, so it is something that we can investigate in due course, but we don't have to do it immediately, Steve.
