-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hans Reiser wrote: > Jeff, are you sure that you need this code to exist? Here are the > problems I see: > > for the average case, it is suboptimal. The seeks to the bitmaps > are far more expensive than the averaged cost of keeping them in ram. > > for 16TB filesystems, they will have plenty of budget for ram > > it complicates code if it has to worry about such things as not > enough clean memory for holding bitmaps, etc. > > It is more appropriate to write this kind of code for the > development branch which is V4. This kind of code is likely to have > hard to test and hit bugs. > > The mount time problem should be solved by querying the device > geometry, and inserting into the queue requests for every disk drive in > parallel. The current code fails to keep all the spindles busy. It > would be nice if there was general purpose code for querying about how a > device divides into spindles so that scheduling in general can be optimized. > > This should be a nondefault mount option. > > That said, thanks for paying attention to a problem Namesys discussed > but lacked the manpower for addressing. Do you think you could discuss > your plans before coding next time? I agree that ReiserFS V3 and V4 > mount time is too long. 15 minutes is clearly not acceptable. Perhaps > there is a deeper IO scheduler problem beyond bitmaps that should be > addressed though..... >
Hans - There are two issues here: The amount of time required to read in the bitmap blocks at mount time, and the resources that are wasted due to maintaining unused bitmap data in memory. Your arguments are reasonable, but the user response to each of them is the same: They will simply choose another filesystem to deploy rather than deal with the caveats of ReiserFS. I agree that there may be opportunities to optimize the I/O scheduler, but even if we ignored the blockdev<->filesystem layering violations, and had perfect knowledge of the storage subsystem, there is still latency associated with reading the data in. There may be any number of abstractions between the block device presented to the filesystem and the actual spindles (md, dm, loop, or hardware raid) and the block dev subsystem is best equipped to handle that. The goal is not to make mount times quicker than they are now, but to make them negligible. Suppose for the sake of argument that somehow the I/O scheduler could be leveraged to reduce the mount time by 90%. This is an incredibly optimistic number and still it only reduces the 15 minute mount time to 90 seconds. That's 90 seconds *every* boot that the system becomes unavailable. That 90 second addition adds up, and will be the difference between a site deploying reiserfs and choosing another solution that doesn't have that caveat. That said, the resource savings benefit is largely secondary, but may be quite important for many users including those deploying embedded devices. We are not in the position to be making hardware purchasing guidelines for our users. It's not reasonable to expect more than the disk space required to store the filesystem itself. "Huge" filesystems that were once reserved for large servers can now be found on the desktop. For a few hundred dollars in hardware, I can construct a multi-terabyte array under my desk. A typical usage for something like this would be to store music, movies, or say an A/V editing suite. On a system with 512MB of RAM, the 32 MB allocation for ONLY bitmaps is a huge resource hit. On embedded systems that are tight on RAM, where they are using alternate C libraries to shave off a few KiB of memory use, pinning bitmaps is a total waste of resources. Telling the user "go buy more memory" is not an acceptable solution. Again, this will only mean another user chooses a different solution than reiserfs. ReiserFS v3 has an established track record as a stable filesystem. V4 may be an excellent successor, but many users simply aren't interested. They want particular features now and aren't willing to be guinea pigs for V4 in order to get them. We've seen this time and again with feature additions. Denying user demands with the mantra of "wait for it in V4" has left many users frustrated, and they will once again choose something else rather than deal without features they can have on other filesystems. The performance difference, I suspect, will be negligible. If the bitmaps are really in heavy use (which is only the case for a limited set of workloads) then the buffer cache will keep those around anyway. If the memory is needed elsewhere, the system has the "big picture" view and should be able to make those decisions. Having to swap out user code or data vs. keeping ReiserFS bitmaps in memory is going to have a performance impact either way, and I suspect the former will be the worse case. Regarding the unavailability of memory for bitmaps, we must already sleep in order to get the buffer heads for parts of the tree that aren't pinned in memory. This case isn't any different. We also already sleep waiting for bitmap blocks to become unlocked. As for it being the default case or not, I've only posted this code for testing purposes. Eventually, I think it should be the default case. We've seen what happens when useful features get buried under a mount time option (-oattrs, anyone?) - they get ignored. I think that once this code has seen active testing, -opin_bitmaps should become an option and reading them on-demand should become the default. - -Jeff - -- Jeff Mahoney SuSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) iD8DBQFCzVKKLPWxlyuTD7IRAm8AAJ9i8D5VTak/puOg0yLuUtmKxvWcZQCePGZu /UR00EcaRwM2t3qZ0D9vuF4= =7/vu -----END PGP SIGNATURE-----
