Re: [Pvfs2-developers] server crash on startup with millions of files

Sam Lang Fri, 23 Feb 2007 13:37:33 -0800


On Feb 23, 2007, at 2:38 PM, Rob Ross wrote:

Wow, I didn't think that this stuff would come up again soon :).
The current implementation (a) tracks what is free and (b) what isrecently used, (c) lets the server choose the handle to return, and(d) keeps a global handle space (a handle is unique across allservers).
the point of (a) was to avoid having to hit storage to find a freehandle. i agree that this is perhaps not that big a deal now thatwe have a better handle on how to efficiently use berkeley db.
(a) and (c) together mean that clients never have to retry to get ahandle. i agree that in itself isn't all that valuable.
(b) ensures that clients caching metadata on some file don't end upaccessing some newly created file's data, or deleting some newfile's object, or some similar thing. this is an important part ofallowing clients to cache file metadata (specifically datafilehandles) without coordination.
(c) also allows us to precreate objects if we wanted to, althoughwe don't do that right now. this would be less important if/whenserver-to-server communication is in place and we move filecreation over to the server side.
(d) eventually allows us to move objects around without updatingthe file's metadata, assuming that we come up with a differentmechanism for determining where a file resides. A bloom filter sortof approach might work, as an example. Or if server-to-server wereworking the servers could just figure out where things are withsome aggregate comm.
walt's idea seems to allow us to map a collection of objects (a"segment") to a given server, then a client could pick values inthat segment. my feeling is that this hamstrings our ability tomove objects around, because we would then need to move entiresegments around, as at the very least it could take a very longtime to reach a consistent state again (think of many large objectsneeding to be moved; how do clients know where to contact?). thisidea is a generalization of pete's idea to have a server id be partof the object handle; pete's approach makes it impossible tomigrate without changing file metadata. more on this below.
pete's idea of speeding up creates by guessing at free handles, isok, but the right way to speed up creates is to precreate. then thelatency can be hidden in the mix of other operations. lustrealready does this, and i believe it is every effective for them.

Can I try to clarify what precreate means for PVFS? There aredifferent pieces that can might benefit separately from pre-creation.


1.  create metadata handle.  requires:
        a. client message to md server
        b. new dspace db entry
        c. new keyval db entry

2.  create datafile handles.  requires:
        a. client message to each IO server
        b. new db entry in IO server

3. setattr of datafile handles array to metadata
        a. client message to md server
        b. modify keyval db entry

4. crdirent to 2nd metadata server (potentially the same)
        a. client message to md server 2
        b. create keyval db entry

5. create bstream file (this happens on first write)

With PVFS right now (no server-to-server, etc) it should be easy toget rid of 2a, and move 2b to the first write (with the bstreamcreate). All we need to do is partition the datafile handles of eachIO server to the metadata servers in the fs.conf (a sort of pre-create). This allows the md server to pick datafile handles and do1,2, and 3 all in one message.

If its still necessary to maintain ledgers for the datafile handles,they can be kept on the md servers (one for each IO server), and atinitialization, we can pull the handles out from the keyval entriesto populate the ledger.

Usually when I hear pre-create I think of trying to get rid of allthose client->IOserver messages. But there's still the cost ofadding the db entries and bstream files. My guess would be that thiscost is negligible compared to the latency of sending all thosemessages, but maybe on fast networks its not. We could certainlypopulate the db with a bunch of entries, and keep a ledger of unusedhandles, but that requires making the ledger persistent. I don'tthink keeping placeholder db entries that have to be filled-in forsetattr and crdirent will be much faster than creating new ones.

There's also still the cost of the crdirent, which is a keyval dbentry create. That can't really be precreated, and its got to besynchronous with the other creation bits.

As a first pass, partitioning the datafile handles for each IO serverup amongst the metadata servers seems like an obvious improvement,and then when server-to-server is in place, making that partitioningprocess more dynamic would be an easy add-on. My gut feeling is thatplaceholder entries in the db would be significant effort to code andisn't worth the benefit.

he is correct that randomly picking values would lead towards nastydata structures in the ledger.

I don't see the point of keeping the ledger around (in its rangestyle data structure form) if we're picking values randomly.

i'd be happy to see the "free" list part of the leder disappear ifthat helps. i do think that the "recently freed" list has to stayfor the reason listed above, although it could be implementeddifferently perhaps -- maybe just leave an entry in the DB notingwhen the object was freed, and if it is referenced again after theappropriate time we consider it up for grabs? this has a nice side-effect of keeping the object "off limits" even if a server isrestarted.


That sounds nice!

-sam

i don't understand why it is difficult to get a value in aparticular range in the OSD work. can you clarify this pete? can'tyou just "guess" a value in the range until you get one?
one thing that we could discuss is the relative merit of migrationusing this sort of approach. maybe in fact this idea that i havethat we want to keep a FS-wide object handle space is flawed, thatchanging file metadata can be addressed in a reasonable way thatsimplifies the overall system, allows for migration, and doesn'thave a negative impact on our caching of metadata.
overall i think that changing how we reference objects, with theexception of perhaps redoing how we keep up with free/recently-freed objects, is something that should perhaps wait until we haveserver-to-server working. we're likely to want to make some changesat that point anyway, once the system has more control over theconstruction of files and directories. maybe we can discuss howwe'd like things to work in that context and concentrate on gettingthere, rather than torquing things now and then perhaps messingwith things again?
thanks everyone! it's fun to get to sit and think about this stuff,especially after many days of travel and meetings :).
regards,

rob

Walter B. Ligon III wrote:
I don't understand this. Is there a scheme whereby there is nomapping of the handle ID to a server? If not, then what we aretalking about, I think, is whether the server mapping is fixed ornot. The idea behind the current scheme was to make the mappingof servers to handles flexible. That said, the specificimplementation might could be better. For example, using 128 bitswe could have a 64 bit segment tag and a 64 bit handle ID. Thesegment tag would map the handle to a server via the tables, andthe ID would be unique within that segment. This might simplifysome things without losing the flexibility we have.As it is, the server can still randomly pick an ID, or a clientcould randomly pick an ID, they just have to do it within a range,which isn't particularly hard. With this suggested modificationwe could "eliminate" the range by giving all "handle ranges" abuilt-in extent of 64 bits, which I think is the same as what youwere suggesting.If I'm not being clear, let me know and I'll try again. Or, if Idon't understand the problem, let me know that.
Walt
Pete Wyckoff wrote:
For create scalability, you may want the client to pick handle IDs
and offer those to the server, so that you can optimistically create
a metafile assuming there are no collisions on the server.  These
guessed handle IDs can be random though.  We did not implement this
as it would be quite expensive if implemented in terms of the
existing extent/extentlist/ledger data structures.

In the OSD work, we have to do painful things to return a handle ID
in a particular range.  I would much rather have the server pick a
random ID and give it to the client.  Or for the client to try to
pick a particular ID and hope there is no collision at the server.

So I'd like to discard the idea of pre-assigned per-server handle
ranges and augment our notion of PVFS_handle to include some sort of
"server identifier" as well as the 64-bit ID that is private to the
particular device on which the object sits.

Various distributed FS implementations for wide-area use seem to be
happy with 128-bit handles and assume collisions will never happen.
This always struck me as wildly reckless, but maybe it is time to
accept the fact that these number spaces are really big.

        -- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] server crash on startup with millions of files

Reply via email to