Re: [Cluster-devel] [GFS2 Patch] [Try 2] GFS2: Reduce file fragmentation

Bob Peterson Wed, 11 Jul 2012 11:52:10 -0700

----- Original Message -----
| Hi,
| 
| On Wed, 2012-07-11 at 14:07 -0400, Bob Peterson wrote:
| [snip]
| >  
| > | What is the difference between rs_free and rs_blks? Shouldn't
| > | these
| > | two
| > | always be identical, since there is no point in reserving blocks
| > | which
| > | are not free.
| > 
| > I guess I used a misleading variable name. The two variables have
| > two meanings and both are needed. I renamed variable rs_blks to
| > rs_len
| > because it represents the length of the reservation.
| > 
| Thats not really answering the question though... all the blocks in
| the
| reservation must be free, otherwise there is no point in reserving
| them.
| So rs_free should be identical to rs_len or whatever it is called.
| Either that or maybe I'm not understanding why there are two
| different
| variables?


The variables and their meanings are as follows:

1. rs_start - this is where the block reservation originally started.
   This never changes during the life of the reservation.
2. rs_len - this is the length of the reservation, in blocks.
   This never changes during the life of the reservation.
3. rs_free - this is how many of those blocks are free.
   This is decremented every time a block is claimed from the reservation.

So the number of blocks "used" or "claimed" from the reservation is len - free.

We could likely accomplish the same thing with only two variables,
by bumping rs_start and subtracting rs_len when every block is claimed
from the reservation, but I'm also using rs_start as a means of
keeping the reservations aligned in the bitmap.

I think keeping the reservations on u64 boundaries gives us the best
performance for function memchr_inv, which I think is optimized to use
word compares where it can.

Doing it my way also makes it easier to read the trace points: You can
see where the reservation started, what's been claimed and what's free.
It's easier to detect problems with overlapping reservations and such.

| [snip]
| > | 
| > | I'm not sure that I understand this comment at all. Currently
| > | with
| > | directories we never deallocate any blocks at all until the
| > | directory
| > | is
| > | deallocated when it is unlinked. We will want to extend this to
| > | directories eventually, even if we don't do that immediately.
| > 
| > I clarified the comment to make it more clear what's going on.
| > I'm talking about gaps in the _reservation_ not gaps in the blocks.
| > The current algorithm makes assumptions based on the fact that
| > block
| > reservations don't have gaps, and the "next" free block will be the
| > successor to the last claimed. If you use reservations for
| > directories,
| > what can happen is that two files may be created, which claims two
| > blocks in the reservation. If the first file is deleted from the
| > directory, that block becomes a "hole" in the reservation, which
| > breaks
| > the code with its current assumptions. We either have to:
| > (a) keep the current assumptions which make block claims faster, or
| > (b) Make no such assumptions and implement a bitmap-like search of
| > the
| >     reservation that can fill holes. It wouldn't be too tough to
| >     do,
| >     especially since we already have nicely tuned functions to do
| >     it.
| >     I'm just worried that it's going to hurt performance.
| >  
| That is just a bug in the way we are doing the allocations. The
| allocation of new inodes should be done based on the inode's own
| reservation, and not on the reservation of its parent directory. That
| is
| something else on the "to fix" list, but it is complicated to do,

No, it goes beyond that. It has to do with the way the block accounting is
done for the reservations. If a big write is trying to write 7 blocks
(let's say in a multi-page write, which isn't implemented yet, but similar
things happen today) and rs_free says there are 7 free blocks in the
reservation, it claims them all starting with rs_start + (rs_len - rs_free).
If there are "holes" in the reservation, that would throw the whole thing
off. It would have to do a bitmap search for the reservation to figure out
where the first available block is. If there are 7 free blocks, but one of
them is a "hole" and six are contiguous, it has to bitmap-search of the
reservation to find each of the 7.

On the other hand, if we allow holes and adjust the algorithm
appropriately, I think the file system will end up being more fragmented
than the current algorithm. This is written with the thought that files
will have larger runs of data blocks and metadata blocks can fill in the
holes left behind.

The other approach that I talked about above (incrementing the starting
block and decrementing the size) would solve this problem, but the
deleted file would force a block to be "left behind" for a future
reservation to find, which would likely add to the fragmentation of the
file system. I could be wrong about that, and we could prototype it to
find out for sure. (IOW, it may not be any worse, since we're talking about
directories which bypass the reservations and do individual searches
anyway).

Regards,

Bob Peterson
Red Hat File Systems

Re: [Cluster-devel] [GFS2 Patch] [Try 2] GFS2: Reduce file fragmentation

Reply via email to