-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hans Reiser wrote:
> Jeff, are you sure that you need this code to exist?  Here are the
> problems I see:
> 
>     for the average case, it is suboptimal.  The seeks to the bitmaps
> are far more expensive than the averaged cost of keeping them in ram.
> 
>     for 16TB filesystems, they will have plenty of budget for ram
> 
>     it complicates code if it has to worry about such things as not
> enough clean memory for holding bitmaps, etc.
> 
>     It is more appropriate to write this kind of code for the
> development branch which is V4.  This kind of code is likely to have
> hard to test and hit bugs.
> 
>     The mount time problem should be solved by querying the device
> geometry, and inserting into the queue requests for every disk drive in
> parallel.  The current code fails to keep all the spindles busy.  It
> would be nice if there was general purpose code for querying about how a
> device divides into spindles so that scheduling in general can be optimized.
> 
>     This should be a nondefault mount option.
> 
> That said, thanks for paying attention to a problem Namesys discussed
> but lacked the manpower for addressing.  Do you think you could discuss
> your plans before coding next time?  I agree that ReiserFS V3 and V4
> mount time is too long.  15 minutes is clearly not acceptable.  Perhaps
> there is a deeper IO scheduler problem beyond bitmaps that should be
> addressed though.....
> 

Hans -

There are two issues here: The amount of time required to read in the
bitmap blocks at mount time, and the resources that are wasted due to
maintaining unused bitmap data in memory. Your arguments are reasonable,
but the user response to each of them is the same: They will simply
choose another filesystem to deploy rather than deal with the caveats of
ReiserFS.

I agree that there may be opportunities to optimize the I/O scheduler,
but even if we ignored the blockdev<->filesystem layering violations,
and had perfect knowledge of the storage subsystem, there is still
latency associated with reading the data in. There may be any number of
abstractions between the block device presented to the filesystem and
the actual spindles (md, dm, loop, or hardware raid) and the block dev
subsystem is best equipped to handle that. The goal is not to make mount
times quicker than they are now, but to make them negligible. Suppose
for the sake of argument that somehow the I/O scheduler could be
leveraged to reduce the mount time by 90%. This is an incredibly
optimistic number and still it only reduces the 15 minute mount time to
90 seconds. That's 90 seconds *every* boot that the system becomes
unavailable. That 90 second addition adds up, and will be the difference
between a site deploying reiserfs and choosing another solution that
doesn't have that caveat.

That said, the resource savings benefit is largely secondary, but may be
quite important for many users including those deploying embedded
devices. We are not in the position to be making hardware purchasing
guidelines for our users. It's not reasonable to expect more than the
disk space required to store the filesystem itself. "Huge" filesystems
that were once reserved for large servers can now be found on the
desktop. For a few hundred dollars in hardware, I can construct a
multi-terabyte array under my desk. A typical usage for something like
this would be to store music, movies, or say an A/V editing suite. On a
system with 512MB of RAM, the 32 MB allocation for ONLY bitmaps is a
huge resource hit. On embedded systems that are tight on RAM, where they
are using alternate C libraries to shave off a few KiB of memory use,
pinning bitmaps is a total waste of resources. Telling the user "go buy
more memory" is not an acceptable solution. Again, this will only mean
another user chooses a different solution than reiserfs.

ReiserFS v3 has an established track record as a stable filesystem. V4
may be an excellent successor, but many users simply aren't interested.
They want particular features now and aren't willing to be guinea pigs
for V4 in order to get them. We've seen this time and again with feature
additions. Denying user demands with the mantra of "wait for it in V4"
has left many users frustrated, and they will once again choose
something else rather than deal without features they can have on other
filesystems.

The performance difference, I suspect, will be negligible. If the
bitmaps are really in heavy use (which is only the case for a limited
set of workloads) then the buffer cache will keep those around anyway.
If the memory is needed elsewhere, the system has the "big picture" view
and should be able to make those decisions. Having to swap out user code
or data vs. keeping ReiserFS bitmaps in memory is going to have a
performance impact either way, and I suspect the former will be the
worse case. Regarding the unavailability of memory for bitmaps, we must
already sleep in order to get the buffer heads for parts of the tree
that aren't pinned in memory. This case isn't any different. We also
already sleep waiting for bitmap blocks to become unlocked.

As for it being the default case or not, I've only posted this code for
testing purposes. Eventually, I think it should be the default case.
We've seen what happens when useful features get buried under a mount
time option (-oattrs, anyone?) - they get ignored. I think that once
this code has seen active testing, -opin_bitmaps should become an option
and reading them on-demand should become the default.

- -Jeff

- --
Jeff Mahoney
SuSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFCzVKKLPWxlyuTD7IRAm8AAJ9i8D5VTak/puOg0yLuUtmKxvWcZQCePGZu
/UR00EcaRwM2t3qZ0D9vuF4=
=7/vu
-----END PGP SIGNATURE-----

Reply via email to