Hi Greg,

In this case where an inode is created on mds.a and exported to mds.b, there is 
a potential race on mds.b between a subsequent lookup-by-ino and the primary 
link actually making it into the inode container.

Our tentative solution was to rely on the way InoTable breaks up the range of 
inode numbers based on mds nodeid. So when a lookup on the inode container 
fails, we can determine which mds would have allocated that inode number and 
attempt to find the inode there. The originating mds.a should always find the 
inode in its cache while it's pinned for export. Depending on whether the inode 
is found on mds.a, the lookup-by-ino on mds.b either returns failure or waits 
for the import to finish.

Casey

----- Original Message -----
From: "Gregory Farnum" <[email protected]>
To: "Casey Bodley" <[email protected]>
Cc: "Matt W. Benjamin" <[email protected]>, [email protected], 
"aemerson" <[email protected]>, "peter honeyman" 
<[email protected]>, "Sage Weil" <[email protected]>
Sent: Wednesday, October 17, 2012 4:18:04 PM
Subject: Re: parent xattrs on file objects

On Wed, Oct 17, 2012 at 12:40 PM, Casey Bodley <[email protected]> wrote:
> To expand on what Matt said, we're also trying to address this issue of 
> lookups by inode number for use with NFS.
>
> The design we've been exploring is to create a single system inode, 
> designated the 'inode container' directory, which stores the primary links to 
> all inodes in the filesystem. These links are named by their inode number to 
> satisfy lookups and obviate the need for an anchor table. This design allows 
> the inode container to make use of existing directory fragmentation and load 
> balancing to distribute the inodes over the MDS cluster.
>
> When a new file is created, it then adds two links: a primary link into the 
> inode container, and a remote link into the filesystem namespace. In the case 
> where the parent directory fragment's authority is different than the 
> corresponding inode container fragment's, it is created in the parent 
> directory then exported to the inode container via an asynchronous slave 
> request.
>
> We welcome additional discussion, both on this design specifically and on the 
> general topic of scalable ino lookups.

So if the primary link isn't always in the "inode container", you must
be preserving the anchor table for this setup. Am I understanding that
correctly? Or is there some other mechanism for linking them that's
less expensive?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to