The fid subsystem was designed to do that - fids in the version of lustre we
posted are unique cluster wide and have a location database.

- Peter -

On 4/25/07, Shobhit Dayal <[EMAIL PROTECTED]> wrote:

Thanks,
 I think I see the direction we need to head in. I guess there are no
short term hacks to get multiple mds's up
in our environment. What we were really interested in was demonstrating
the dynamic redistribution concept
with as little change to lustre as possible. But I guess what you are
saying is that it may not be possible to do that.

Its also helpfull to know the fid's approach you took for managing inode
changes.

Thanks for taking so much time out, and writing such detailed mails.
when manage to bring up something, I'll let you know :)

-Shobhit




On 4/25/07, Andreas Dilger <[EMAIL PROTECTED]> wrote:
>
> On Apr 23, 2007  23:38 -0400, Shobhit Dayal wrote:
> > To give you some context of what we are doing,  we're trying to build
> a
> > clustered mds servise in lustre based on a paper from cmu on dynamic
> > redistribution.
> > http://www.pdl.cs.cmu.edu/PDL-FTP/SelfStar/CMU-PDL-06-105_abs.html
> >
> > We aren't really looking to replicate the mds servers in this design,
> that
> > was just a hack to get us started on getting two mds's up that shared
> a
> > namespace, by copying the ext3 of one mds to another.
>
> If you aren't looking at replication, then you are in fact implementing
> exactly what the CMD project at CFS has been working to complete.
>
> > So for instance, if there are multiple mds's, each serving a part of a
> > global name space, and the client issues a rename that renames a file
> from
> > mds2 to mds1, the following approach can be used in the context of
> Lustre:
> >
> > Mount from mds1 the ext3 filesystem of mds2.
> > delete the original file in ext3 of mds2
> > create a new file in the appropriate path on ext3 of mds1.
> > umount ext3 of mds2 from mds1.
> > All the above operations can be transactioned locally on mds1 for
> atomicity.
>
> Since ext3 is itself not a shared filesystem and can only be mounted on
> a
> single MDS at one time, it would be FAR easier and faster to just have
> MDS1
> do a synchronous operation to MDS2 instead of trying to coordinate
> unmounting
> and remounting the filesystem across nodes.
>
> > We'll have to deal with the problem that deleting the file from mds2
> and
> > recreating it on mds1 will change its inode number and generation
> count,
> > since these values are directly used at the OST as an object
> reference. And
> > so we are implementing something that will allow us to remember the
> old
> > inode numbers and generation count on mds1.
>
> CMD implemented a new abstraction layer for file identifiers "FIDs" that
> keep the ext3 inode numbers internal to the filesystem and expose only
> abstracted numbers for the inodes to the clients.
>
> > But we're stuck on the problem of even bringing up two mds's in the
> lustre
> > environment and getting an OST with one LOV to share that LOV between
> both
> > the mds's. Lustre doesnt allow us to configure mds's/ost's in this
> way.
> > OST's dont listen to two mds's at the same time.
>
> The LOV is really for client->many OST communication, and you would need
> the
> equivalent LMV layer for client->many MDT communication.  Each inode
> would
> get a lmv striping EA that tells the client which MDT the inode resides
> on,
> just like the lov EA tells which OST the object lives on.
>
> > Is there an easy way to bring up two mds's such that an OST with a
> single
> > lov will allow two mds's to connect to it, and pass around object
> references
> > to objects that lie in this single volume?
>
> You need to add an Logical Metadata Volume (LMV) layer to have the
> single
> llite->MDC connection be multiplexed to multiple MDTs.
>
> > On 4/23/07, Andreas Dilger <[EMAIL PROTECTED]> wrote:
> > >
> > >On Apr 20, 2007  19:00 -0400, Shobhit Dayal wrote:
> > >> We're a group of students at CMU and we're building a project
> around
> > >> lustre. A main part of the work involves introducing multiple mds
> > >servers in
> > >> lustre.
> > >
> > >I'm sad to inform you that the work for introducing multiple MDTs for
> > >a single filesystem has been going on for several years already, and
> > >is mostly done (target for release some time at the end of this
> year).
> > >This is what we call "clustered metadata" (CMD).  I'm not sure what
> our
> > >policy is for releasing an alpha version of this code would be.
> > >
> > >> Now we have a design for managing metadata from multiple mds's, but
> we
> > >were
> > >> wondering how much work it is, besides changing mds metadata
> management,
> > >> to introduce a new active mds server. Our impression so far is that
> > >neither
> > >> the client nor the ost's will work easily with a new active mds
> entity
> > >in
> > >> the cluster in terms of managing connections from multiple mds's
> and
> > >that
> > >> they will have to be changed. Is this correct ?
> > >
> > >For CMD, there is a new "logical metadata volume" (LMV) that handles
> the
> > >connections from the filesystem to the multiple MDTs.  This is
> somewhat
> > >analogous to the LOV, in that it spreads MDT access and operations
> over
> > >the multiple MDTs.  Each MDT is still mostly independent in that they
>
> > >export a single ext3 filesystem (like multiple OSTs on a single OSS),
> > >rather than any shared-access to the same block device.
> > >
> > >> For instance, for experiment purpose: we created a
> client-->mds-->ost
> > >and
> > >> created some file through them 'foo', 'bar'. Then replicated the
> file
> > >system
> > >> on the mds that stores all the metadata onto another mds mds2.
> > >> Now we introduced a second client and tried to setup the
> connections
> > >> client2-->mds2-->ost
> > >
> > >Ah, this is somewhat different than CMD where each MDT is a (mostly)
> > >independent subset of the filesystem.  The CMD code has no
> replication
> > >between MDTs.  That would definitely be an interesting and worthwhile
> > >project.  It would be implemented in a very similar manner, with a
> > >replicating layer between llite and the MDC, each MDC connecting to a
> > >separate MDT.
> > >
> > >> This setup does not work when foo, bar are written from both
> clients.
> > >> changes cannot be seen from both clients. As soon as the second mds
>
> > >> connects, the client1, mds1 seem to loose their connection with the
> ost.
> > >>
> > >> Can someone point us to the right way to bring up two mds's in the
> > >lustre
> > >> environment, even though it may lead to data/metadata corruption ?
> > >
> > >You need a layer like LOV is for OSCs to handle multiple independent
> > >connections.  Then, that layer should handle replicating the requests
> to
> > >each of the MDTs for modifying events (in MDT order), and could e.g.
> > >round-robin for read-only events (e.g. getattr) to help spread the
> load.
> > >
> > >Cheers, Andreas
> > >--
> > >Andreas Dilger
> > >Principal Software Engineer
> > >Cluster File Systems, Inc.
> > >
> > >
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
>

_______________________________________________
Lustre-devel mailing list
Lustre-devel@clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to