On Mon, Nov 25, 2013 at 6:05 PM, Prakash Surya <[email protected]> wrote:

> On Mon, Nov 25, 2013 at 05:14:51PM -0800, Matthew Ahrens wrote:
> > Overall this seems like a fine idea.  Here are a few things we'll need
> to think
> > about:
> >
> > Allocating the dnodes becomes trickier now; we will need to think about
> how
> > slow it could be.  Depending on the use case, it could make sense to have
> > &quot;slabs&quot; of dnodes of different sizes.  E.g. each 16k block
> could
> > contain dnodes of all the same size.  Although then you have the problem
> of
> > finding a slab of the right dnode-size...
> >
> > I'd like to see a use case before we actually integrate this.
>  Presumably the
> > extra space is used for the bonus buffer and thus the SA.
>
> Yes, exactly. Currently, in the linux port, one is able to use the SA to
> store xattrs. Thus, depending on the size of the xattr stored, getxattr can
> result in an extra IO to fetch the spill block. My use case is to speed
> this up by eliminating the extra spill block read.
>
> >
> > I think the use case will inform the user interface.  Ideally the
> specific use
> > case could handle setting the dnode size tranparently to the admin,
> without the
> > need for a &quot;dnodesize&quot; property.
>
> The "dnodesize" property was mainly just a convince feature I added for
> testing. The interface I am currently thinking of using, is to modify
> the prototype of dmu_object_alloc to take a size parameter (i.e. the
> prototype of dmu_object_alloc_impl which I added).
>
> My main motivation is because Lustre uses xattrs to store information
> about a file, and it is just large enough to use a spill block. Since it
> ties into the DMU, it could allocate objects with large enough dnodes
> such that its xattrs fit in the SA.
>
> With that said, I could see an interface to change the default dnode
> size being useful for specific workloads. I'm not sure a dataset
> property is the proper way to expose that information though.
>
> >
> > The dnode_phys_t is 512 bytes, which happens to be the same as the
> prevalent
> > disk sector size (512).  But this is a coincidence, and newer disks have
> 4K
> > sectors.  So I'd suggest you not name things in terms of sectors, but
> rather as
> > &quot;count&quot; (e.g. this dnode uses the space of 3 minimum-size
> dnodes) or
> > just &quot;bytes&quot; (which we assert is always a multiple of
> DNODE_SIZE).
>
> Sure. I was just trying to be consistent with the "dn_datablkszsec"
> field in that structure. The only draw back to using "bytes" is it would
> consume more space in the dnode_phys_t than a "count" of 512b chunks.
> Personally, "bytes" is easier to reason with, but I'm not sure it's
> worth another 8 bits (16 total) to store the size?
>

For on-disk, you probably want the field to be "number of subsequent
dnode_phys_t's consumed by this one".  That way the current value of 0 is
consistent with the current on-disk format, so the "upgrade" can simply
enable the feature.

I'm working on a feature that also does something like per-dataset feature
flags, so I'd be happy to work with you on creating an infrastructure to do
this for any number of features in a generic way.  Basically, the refcount
of the pool-wide feature flag is the number of datasets that could be using
the feature.  Each dataset keeps track of if it could be using the feature
(with an entry in the "extensible dataset" ZAP), and thus if it is counted
in the feature flag refcount.  There are a few places that need to deal
with copying the flag and bumping the refcount (e.g. snapshot, clone).  We
could do that in a more generic way so that code won't need to be added to
these places for every feature that wants to handle its refcount this way.

--matt


>
> --
> Cheers, Prakash
>
> >
> > --matt
> >
> >
> > On Mon, Nov 25, 2013 at 11:09 AM, Prakash Surya <[email protected]> wrote:
> >      Hi,
> >
> >      I have a compelling use case for larger dnodes and have spent a
> small
> >      amount of time looking into a potential way of implementing this.
> >      But,
> >      before I spend any more time on it, I want to get some buy in from
> >      the
> >      community and upstream on the implementation details
> >
> >      My current plan for adding support for larger dnodes is actually
> >      pretty
> >      simple. I want to &quot;steal&quot; 8 bits from the unused padding
> >      currently in
> >      the dnode_phys_t structure, use these bits to store a size for the
> >      dnode
> >      (in number of 512 byte sectors), and then consult this size whenever
> >      traversing a dnode block (in order to properly skip over the
> >      necessary
> >      amount of bonus space and reach the &quot;next&quot; dnode in the
> >      block). This
> >      would not only allow us to have a larger fixed size for dnodes, but
> >      it
> >      would also allow us to have variable sized dnodes.
> >
> >      I have a *VERY* experimental patch to update the object allocation
> >      code
> >      to test this idea out, and it seems to work pretty well. I can
> create
> >      a
> >      dataset, create dnodes with varying size (.5K - 16K), and then use
> >      zdb
> >      to ensure the object number allocation worked as planned. If you are
> >      interested in checking out the source, it's up on github:
> >
> >          https://github.com/prakashsurya/zfs/commit/
> >      15c110722a50bd7196a71b9a360e5217445c7d13
> >
> >      Does anybody see any major road blocks to implementing larger dnodes
> >      in
> >      this way? I'm still pretty unfamiliar will this section of the code,
> >      so
> >      I'd love for some input from the people more familiar with it.
> >
> >      There is still a lot of work to go from this experimental patch to
> >      something ready to land. The send/receive bits, diff bits, and code
> >      to actually use the extra bonus area all remain to be implemented; I
> >      was
> >      just interested in testing out the object allocation code with that
> >      patch.
> >
> >      To properly land upstream, this type of change is dependent on
> adding
> >      dataset level feature flags as well. So I'll soon need to bring up
> >      that
> >      discussion, and decide on what's the best way to implement that.
> >
> >      I welcome any and all feedback on this idea. So far, through testing
> >      and
> >      off list discussions, I don't foresee any major issues. It's just a
> >      matter of careful implementation and extensive testing.
> >
> >      --
> >      Cheers, Prakash
> >
> >      _______________________________________________
> >      developer mailing list
> >      [email protected]
> >      http://lists.open-zfs.org/mailman/listinfo/developer
> >
>
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to