Hi Andreas,
thanks for your reply, it is really helpful.

To give you some context of what we are doing,  we're trying to build a
clustered mds servise in lustre based on a paper from cmu on dynamic
redistribution.
http://www.pdl.cs.cmu.edu/PDL-FTP/SelfStar/CMU-PDL-06-105_abs.html

We aren't really looking to replicate the mds servers in this design, that
was just a hack to get us started on getting two mds's up that shared a
namespace, by copying the ext3 of one mds to another.

Dynamic redistribution proposes an easier way to decentralising mds service
than implementing distributed transactions for cross server operations such
as rename. It proposes that only a single server perform the rename like
operations by temporarily becoming the owner of both objects until the
operation completes.

So for instance, if there are multiple mds's, each serving a part of a
global name space, and the client issues a rename that renames a file from
mds2 to mds1, the following approach can be used in the context of Lustre:

Mount from mds1 the ext3 filesystem of mds2.
delete the original file in ext3 of mds2
create a new file in the appropriate path on ext3 of mds1.
umount ext3 of mds2 from mds1.
All the above operations can be transactioned locally on mds1 for atomicity.

other operations on mds1 and mds2 on the relevant dir paths will disabled
until rename succeeds.

We'll have to deal with the problem that deleting the file from mds2 and
recreating it on mds1 will change its inode number and generation count,
since these values are directly used at the OST as an object reference. And
so we are implementing something that will allow us to remember the old
inode numbers and generation count on mds1.

But we're stuck on the problem of even bringing up two mds's in the lustre
environment and getting an OST with one LOV to share that LOV between both
the mds's. Lustre doesnt allow us to configure mds's/ost's in this way.
OST's dont listen to two mds's at the same time.

Is there an easy way to bring up two mds's such that an OST with a single
lov will allow two mds's to connect to it, and pass around object references
to objects that lie in this single volume?

Thanks
Shobhit


On 4/23/07, Andreas Dilger <[EMAIL PROTECTED]> wrote:

On Apr 20, 2007  19:00 -0400, Shobhit Dayal wrote:
> We're a group of students at CMU and we're building a project around
> lustre. A main part of the work involves introducing multiple mds
servers in
> lustre.

I'm sad to inform you that the work for introducing multiple MDTs for
a single filesystem has been going on for several years already, and
is mostly done (target for release some time at the end of this year).
This is what we call "clustered metadata" (CMD).  I'm not sure what our
policy is for releasing an alpha version of this code would be.

> Now we have a design for managing metadata from multiple mds's, but we
were
> wondering how much work it is, besides changing mds metadata management,
> to introduce a new active mds server. Our impression so far is that
neither
> the client nor the ost's will work easily with a new active mds entity
in
> the cluster in terms of managing connections from multiple mds's and
that
> they will have to be changed. Is this correct ?

For CMD, there is a new "logical metadata volume" (LMV) that handles the
connections from the filesystem to the multiple MDTs.  This is somewhat
analogous to the LOV, in that it spreads MDT access and operations over
the multiple MDTs.  Each MDT is still mostly independent in that they
export a single ext3 filesystem (like multiple OSTs on a single OSS),
rather than any shared-access to the same block device.

> For instance, for experiment purpose: we created a client-->mds-->ost
and
> created some file through them 'foo', 'bar'. Then replicated the file
system
> on the mds that stores all the metadata onto another mds mds2.
> Now we introduced a second client and tried to setup the connections
> client2-->mds2-->ost

Ah, this is somewhat different than CMD where each MDT is a (mostly)
independent subset of the filesystem.  The CMD code has no replication
between MDTs.  That would definitely be an interesting and worthwhile
project.  It would be implemented in a very similar manner, with a
replicating layer between llite and the MDC, each MDC connecting to a
separate MDT.

> This setup does not work when foo, bar are written from both clients.
> changes cannot be seen from both clients. As soon as the second mds
> connects, the client1, mds1 seem to loose their connection with the ost.
>
> Can someone point us to the right way to bring up two mds's in the
lustre
> environment, even though it may lead to data/metadata corruption ?

You need a layer like LOV is for OSCs to handle multiple independent
connections.  Then, that layer should handle replicating the requests to
each of the MDTs for modifying events (in MDT order), and could e.g.
round-robin for read-only events (e.g. getattr) to help spread the load.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to