On Mar 1, 2007, at 10:00 AM, Sam Lang wrote:
On Mar 1, 2007, at 9:52 AM, Phil Carns wrote:
Sam Lang wrote:
On Feb 28, 2007, at 6:54 AM, Phil Carns wrote:
I know that you guys still have some ongoing discussion about
the long
range design for tracking handles, but I have another item about
the
current implementation that might be of interest.
Most of the remaining startup performance problem (after Sam's
optimization patches) appears to be a result of how the db is
ordered.
If I modify the attr db's comparison function so that it has a "<"
rather than ">", then all of the preads during startup go in order
through the db rather than backwards. This takes the startup
time on a
cold db down to just 34 seconds. Previously it was 2 minutes
22 seconds.
It still could be faster, but that seems to be the biggest part
of the
time. I imagine the rest of it is just the access size (4 KB at
a time) that might be tunable through some berkeley db settings.
The downside of making that particular change to the comparison
method is that it breaks storage space compatibility.
I wonder if it might be possible to accomplish the same thing in
the
current db format by modifying iterate_handles() to just run
the cursor
backwards (using DB_PREV instead of DB_NEXT)? That wouldn't hurt
storage space compability (if it works), but I don't know if it
makes any difference to callers of that function what order the
handles come out in.
It doesn't matter to the caller. You'll also need to set the
cursor to the last position in the db with DB_LAST. Does
DB_PREV work with DB_MULTIPLE though? Its not clear from the
above, does the improvement to 34 seconds occur with MULTIPLE or
without?
I mentioned previously that the dspace db gets opened with the
RECNUM flag. I don't think that's necessary, and removing it
will invariably improve performance, but we need a way to return
the position for iterate_handles. The easiest thing to do is
turn PVFS_ds_position into a uint64_t (currently its only
uint32_t). That breaks interfaces and protocols though.
I don't know if the PREV approach would work with MULTIPLE or
not. The 34 second times (with inverted comparison function) were
run with your MULTIPLE patches applied. I didn't try it without
the patches.
I couldn't find anything in the berkeley db about DB_MULTIPLE_KEY
and DB_PREV not being allowed, but when tried it returns an error
about Illegal flag combinations. So our option is to either use
DB_PREV without DB_MULTIPLE (no storage format changes), or change
the comparison function and storage format so that we can use
DB_NEXT with DB_MULTIPLE_KEY.
Checking the storage format version and providing the appropriate
comparison function wouldn't be hard though, and wouldn't require
any "migration" of the old to new format. Older formats wouldn't
benefit from the performance improvements though.
Can we conclude this discussion? In summary:
* The current comparison function causes bad IO patterns for iterate
on the dspace db. We can change it but the disk format will change
in new releases.
- If we change it, either we check a version number and provide the
right comparison function, or we perform migration to the new storage
format.
- If we don't change it, we can still improve performance by
iterating from the last entry to the first, but we can't use
DB_MULTIPLE_KEY, which also improves performance for big filesystems.
* If we change PVFS_ds_position from uint32_t to uint64_t, we can use
the handle as the position, and avoid opening the dspace db with the
RECNO flag, which is killing our performance on writes.
-sam
-sam
-Phil
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers