-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jeff Mahoney wrote:
> Hans Reiser wrote:
>>> Guys, if you run the kernel under a debugger, and get it to where you
>>> see the excessive CPU usage, and then start stepping through the bitmap
>>> code, I am sure it will be very obvious what the error is.  Can anyone
>>> do that for us?  Jeff?
> 
> Apologies to everyone CC'd who've already seen this message. It was
> bounced from the namesys servers and I wanted to preserve the CC list.
> 
> ***
> 
> Mike sent me a copy of the metadata and I am now able to reproduce
> locally. My profiling looks like this:
> 
> samples  %      image name      app name        symbol name
> 148596  17.8573 reiserfs.ko     reiserfs        reiserfs_in_journal
> 58194   6.9934  reiserfs.ko     reiserfs        search_by_key
> 38937   4.6792  vmlinux         vmlinux         memmove
> 38783   4.6607  reiserfs.ko     reiserfs        scan_bitmap_block
> 38466   4.6226  jbd             jbd             (no symbols)
> 23249   2.7939  vmlinux         vmlinux         __find_get_block
> 18196   2.1867  vmlinux         vmlinux         tty_write
> 17734   2.1312  vmlinux         vmlinux         do_ioctl
> 17293   2.0782  loop            loop            (no symbols)
> 15400   1.8507  vmlinux         vmlinux         cond_resched_lock
> 14836   1.7829  vmlinux         vmlinux         copy_user_generic_c
> 14143   1.6996  reiserfs.ko     reiserfs        do_journal_end
> 13638   1.6389  vmlinux         vmlinux         find_next_zero_bit
> 13236   1.5906  vmlinux         vmlinux         default_llseek
> 12925   1.5532  vmlinux         vmlinux         bit_waitqueue
> 8921    1.0721  vmlinux         vmlinux         __delay
> 
> 
> Hans -
> 
> My speculation about the bitmaps being fragmented was right on. I wrote
> a quick little script to parse the output of debugreiserfs -m and report
> on the frequency of different window sizes. Windows of 1-31 blocks are
> extremely common, accounting for 99.8% of all free windows. The problem
> is that in my testing, where I made the allocator report the size of
> allocation requests, the most common request was for a window of 32 blocks.
> 
> What's happening is that we keep finding windows that are too small,
> which results in a lot of wasted effort. The cycle goes like this:
> 
> if (unfm && is_block_in_journal(s, bmap_n, *beg, beg))
>         continue;
> /* first zero bit found; we check next bits */
> for (end = *beg + 1;; end++) {
>         if (end >= *beg + max || end >= boundary
>             || reiserfs_test_le_bit(end, bi->bh->b_data)) {
>                 next = end;
>                 break;
>         }
>         /* finding the other end of zero bit window requires
>          * looking into journal structures (in
>          * case of searching for free blocks for unformatted nodes) */
>         if (unfm && is_block_in_journal(s, bmap_n, end, &next))
>                 break;
> }
> 
> If the window is too small, we end up looping up to the top and try to
> find another one. Since the overwhelming majority of the windows are too
> small, we go through just about all the bitmaps without backing off the
> window size.

To be clear, eventually the allocations are honored, but only after
*all* of the bitmaps are searched. On the third pass, we drop the window
to a single block and restart the scan, eventually building a 32-block
set that is probably quite fragmented. This occurs on every write, hence
the huge performance hit.

It appears as though ext3 doesn't have this problem because they don't
batch writes the way reiserfs does. They'll start a search at a decent
hint the same way we do, but the window is always one block.

So, we're stuck between a rock and a hard place. We can have the better
allocation performance at lower usage and sacrifice performance later or
we can have stable allocation performance at an overall reduction in
performance.

I have an idea that may get around both problems, but I'm not sure how
well it would be received. We currently do some very basic caching of
bitmap metadata such as the first zero bit and how many free blocks
there are. What if we constructed an extent map of the free windows in
each bitmaps when we cache the metadata and adjust the map when we

There's a third option, but I'm not sure how well it would be received

Right now, the allocator keeps track of things like how full a bitmap is
and where the first zero bit is. It would also be possible to cache a
list of windows in each bitmap to accelerate performance. This would
have to be a shrinkable cache, since the pathlogical case could mean
occupying

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEtEhVLPWxlyuTD7IRAnt8AJ4qnp+578/oqKbyLbXJJoFewfOuSwCcDJJN
izEeprRI0kSOmTZ860sVYOY=
=xUpP
-----END PGP SIGNATURE-----

Reply via email to