On Mon, Nov 25, 2013 at 05:14:51PM -0800, Matthew Ahrens wrote:
> Overall this seems like a fine idea.  Here are a few things we'll need to 
> think
> about:
> 
> Allocating the dnodes becomes trickier now; we will need to think about how
> slow it could be.  Depending on the use case, it could make sense to have
> "slabs" of dnodes of different sizes.  E.g. each 16k block could
> contain dnodes of all the same size.  Although then you have the problem of
> finding a slab of the right dnode-size...
> 
> I'd like to see a use case before we actually integrate this.  Presumably the
> extra space is used for the bonus buffer and thus the SA.

Yes, exactly. Currently, in the linux port, one is able to use the SA to
store xattrs. Thus, depending on the size of the xattr stored, getxattr can
result in an extra IO to fetch the spill block. My use case is to speed
this up by eliminating the extra spill block read.

> 
> I think the use case will inform the user interface.  Ideally the specific use
> case could handle setting the dnode size tranparently to the admin, without 
> the
> need for a "dnodesize" property.

The "dnodesize" property was mainly just a convince feature I added for
testing. The interface I am currently thinking of using, is to modify
the prototype of dmu_object_alloc to take a size parameter (i.e. the
prototype of dmu_object_alloc_impl which I added).

My main motivation is because Lustre uses xattrs to store information
about a file, and it is just large enough to use a spill block. Since it
ties into the DMU, it could allocate objects with large enough dnodes
such that its xattrs fit in the SA.

With that said, I could see an interface to change the default dnode
size being useful for specific workloads. I'm not sure a dataset
property is the proper way to expose that information though.

> 
> The dnode_phys_t is 512 bytes, which happens to be the same as the prevalent
> disk sector size (512).  But this is a coincidence, and newer disks have 4K
> sectors.  So I'd suggest you not name things in terms of sectors, but rather 
> as
> "count" (e.g. this dnode uses the space of 3 minimum-size dnodes) or
> just "bytes" (which we assert is always a multiple of DNODE_SIZE).

Sure. I was just trying to be consistent with the "dn_datablkszsec"
field in that structure. The only draw back to using "bytes" is it would
consume more space in the dnode_phys_t than a "count" of 512b chunks.
Personally, "bytes" is easier to reason with, but I'm not sure it's
worth another 8 bits (16 total) to store the size?

-- 
Cheers, Prakash

> 
> --matt
> 
> 
> On Mon, Nov 25, 2013 at 11:09 AM, Prakash Surya <[email protected]> wrote:
>      Hi,
> 
>      I have a compelling use case for larger dnodes and have spent a small
>      amount of time looking into a potential way of implementing this.
>      But,
>      before I spend any more time on it, I want to get some buy in from
>      the
>      community and upstream on the implementation details
> 
>      My current plan for adding support for larger dnodes is actually
>      pretty
>      simple. I want to &quot;steal&quot; 8 bits from the unused padding
>      currently in
>      the dnode_phys_t structure, use these bits to store a size for the
>      dnode
>      (in number of 512 byte sectors), and then consult this size whenever
>      traversing a dnode block (in order to properly skip over the
>      necessary
>      amount of bonus space and reach the &quot;next&quot; dnode in the
>      block). This
>      would not only allow us to have a larger fixed size for dnodes, but
>      it
>      would also allow us to have variable sized dnodes.
> 
>      I have a *VERY* experimental patch to update the object allocation
>      code
>      to test this idea out, and it seems to work pretty well. I can create
>      a
>      dataset, create dnodes with varying size (.5K - 16K), and then use
>      zdb
>      to ensure the object number allocation worked as planned. If you are
>      interested in checking out the source, it's up on github:
> 
>          https://github.com/prakashsurya/zfs/commit/
>      15c110722a50bd7196a71b9a360e5217445c7d13
> 
>      Does anybody see any major road blocks to implementing larger dnodes
>      in
>      this way? I'm still pretty unfamiliar will this section of the code,
>      so
>      I'd love for some input from the people more familiar with it.
> 
>      There is still a lot of work to go from this experimental patch to
>      something ready to land. The send/receive bits, diff bits, and code
>      to actually use the extra bonus area all remain to be implemented; I
>      was
>      just interested in testing out the object allocation code with that
>      patch.
> 
>      To properly land upstream, this type of change is dependent on adding
>      dataset level feature flags as well. So I'll soon need to bring up
>      that
>      discussion, and decide on what's the best way to implement that.
> 
>      I welcome any and all feedback on this idea. So far, through testing
>      and
>      off list discussions, I don't foresee any major issues. It's just a
>      matter of careful implementation and extensive testing.
> 
>      --
>      Cheers, Prakash
> 
>      _______________________________________________
>      developer mailing list
>      [email protected]
>      http://lists.open-zfs.org/mailman/listinfo/developer
> 
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to