Hi Sage,

I the interest of timeliness, I'll post a few thoughts now.

----- "Sage Weil" <[email protected]> wrote:

> On Wed, 17 Oct 2012, Adam C. Emerson wrote:

> 
> I think this basic approach is viable.  However, I'm hesitant to give
> up 
> on embedded inodes because of the huge performance wins in the common
> 
> cases; I'd rather have a more expensive lookup-by-ino in the rare
> cases 
> where nfs filehandles are out of cache and be smokin' fast the rest of
> the 
> time.

Broadly, for us, lookup-by-ino is a fast path.  Fast for lookups but slow for 
inode lookups seems out of balance.

> 
> Are there reasons you're attached to your current approach?  Do you
> see 
> problems with a generalized "find this ino" function based on the file
> 
> objects?  I like the latter because it
> 
>  - means we can scrap the anchor table, which needs additional work
> anyway 
>    if it is going to scale
>  - is generallly useful for fsck
>  - solves the NFS fh issue

The proposed approach, if I understand it, is costly.  It's optimizing for some 
workloads, at the definite expense of others.  (The side benefits, e.g., to 
fsck might completely justify the cost, however.  "We need it anyway" may only 
be a decisive argument if we've accepted the premise that inode lookups can be 
slow, however.) 

By contrast, the additional cost our approach adds is small and constant--but 
we grant, it's in a fast path.  For motivation, we solve both the lookup-by-ino 
and hard link problems much more satisfactorily, as far as I can see.

Obviously, we -hope- we are not sacrificing "smokin' fast" name lookups for 
(smokin') fast inode lookups.  As in UFS, we can make use of caching, bulkstat 
[which proved to be a huge win in AFS and DFS], and given Ceph's design, 
parallelism to make up the gap in what -we hope- would be the actual common 
case.  Of course we might be wrong.  We haven't implemented all of that yet.  
Maybe we would need to actually do some performance measurement and comparison 
to be convincing, and presumed we would.

> 
> The only real downsides I see to this approach are:
> 
>  - more OSD ops (setxattrs.. if we're smart, they'll be cheap)
>  - lookup-by-ino for resolving hard links may be slower than the
> anchor 
>    table, which gives you *all* ancestors in one lookup, vs this,
> which 
>    may range from 1 lookup to (depth of tree) lookups (or possibly
> more, 
>    in rare cases).  For all the reasons that the anchor table was 
>    acceptable for hard links, though (hard link rarity, parallel link
> 
>    patterns), I can live with it.
> 
> There are also lots of people who seem to be putting BackupPC (or
> whatever 
> is it) on Ceph, which is creating huge messes of hard links, so it
> will be 
> really good to solve/avoid teh current anchor table scaling problems.
> 
> sage

-- 
Matt Benjamin
The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://linuxbox.com

tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to