-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Jeff Mahoney wrote:
> Hans Reiser wrote:
>>> Guys, if you run the kernel under a debugger, and get it to where you
>>> see the excessive CPU usage, and then start stepping through the bitmap
>>> code, I am sure it will be very obvious what the error is. Can anyone
>>> do that for us? Jeff?
>
> Apologies to everyone CC'd who've already seen this message. It was
> bounced from the namesys servers and I wanted to preserve the CC list.
>
> ***
>
> Mike sent me a copy of the metadata and I am now able to reproduce
> locally. My profiling looks like this:
>
> samples % image name app name symbol name
> 148596 17.8573 reiserfs.ko reiserfs reiserfs_in_journal
> 58194 6.9934 reiserfs.ko reiserfs search_by_key
> 38937 4.6792 vmlinux vmlinux memmove
> 38783 4.6607 reiserfs.ko reiserfs scan_bitmap_block
> 38466 4.6226 jbd jbd (no symbols)
> 23249 2.7939 vmlinux vmlinux __find_get_block
> 18196 2.1867 vmlinux vmlinux tty_write
> 17734 2.1312 vmlinux vmlinux do_ioctl
> 17293 2.0782 loop loop (no symbols)
> 15400 1.8507 vmlinux vmlinux cond_resched_lock
> 14836 1.7829 vmlinux vmlinux copy_user_generic_c
> 14143 1.6996 reiserfs.ko reiserfs do_journal_end
> 13638 1.6389 vmlinux vmlinux find_next_zero_bit
> 13236 1.5906 vmlinux vmlinux default_llseek
> 12925 1.5532 vmlinux vmlinux bit_waitqueue
> 8921 1.0721 vmlinux vmlinux __delay
>
>
> Hans -
>
> My speculation about the bitmaps being fragmented was right on. I wrote
> a quick little script to parse the output of debugreiserfs -m and report
> on the frequency of different window sizes. Windows of 1-31 blocks are
> extremely common, accounting for 99.8% of all free windows. The problem
> is that in my testing, where I made the allocator report the size of
> allocation requests, the most common request was for a window of 32 blocks.
>
> What's happening is that we keep finding windows that are too small,
> which results in a lot of wasted effort. The cycle goes like this:
>
> if (unfm && is_block_in_journal(s, bmap_n, *beg, beg))
> continue;
> /* first zero bit found; we check next bits */
> for (end = *beg + 1;; end++) {
> if (end >= *beg + max || end >= boundary
> || reiserfs_test_le_bit(end, bi->bh->b_data)) {
> next = end;
> break;
> }
> /* finding the other end of zero bit window requires
> * looking into journal structures (in
> * case of searching for free blocks for unformatted nodes) */
> if (unfm && is_block_in_journal(s, bmap_n, end, &next))
> break;
> }
>
> If the window is too small, we end up looping up to the top and try to
> find another one. Since the overwhelming majority of the windows are too
> small, we go through just about all the bitmaps without backing off the
> window size.
To be clear, eventually the allocations are honored, but only after
*all* of the bitmaps are searched. On the third pass, we drop the window
to a single block and restart the scan, eventually building a 32-block
set that is probably quite fragmented. This occurs on every write, hence
the huge performance hit.
It appears as though ext3 doesn't have this problem because they don't
batch writes the way reiserfs does. They'll start a search at a decent
hint the same way we do, but the window is always one block.
So, we're stuck between a rock and a hard place. We can have the better
allocation performance at lower usage and sacrifice performance later or
we can have stable allocation performance at an overall reduction in
performance.
I have an idea that may get around both problems, but I'm not sure how
well it would be received. We currently do some very basic caching of
bitmap metadata such as the first zero bit and how many free blocks
there are. What if we constructed an extent map of the free windows in
each bitmaps when we cache the metadata and adjust the map when we
There's a third option, but I'm not sure how well it would be received
Right now, the allocator keeps track of things like how full a bitmap is
and where the first zero bit is. It would also be possible to cache a
list of windows in each bitmap to accelerate performance. This would
have to be a shrinkable cache, since the pathlogical case could mean
occupying
- -Jeff
- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFEtEhVLPWxlyuTD7IRAnt8AJ4qnp+578/oqKbyLbXJJoFewfOuSwCcDJJN
izEeprRI0kSOmTZ860sVYOY=
=xUpP
-----END PGP SIGNATURE-----