That all sounds about right to me.  In illumos we have changed this to a
tunable, so that you may override the "inflation factor" if, for example,
you know that you are not using RAID-Z.  Ideally the code would be changed
to do this automatically (take into account whether you are using RAID-Z,
ditto blocks, and dedup).

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/spa_misc.c#275

--matt



On Sat, Nov 30, 2013 at 10:10 AM, Kohsuke Kawaguchi <[email protected]> wrote:

> In my quest for understanding why my TXG size is so small, I found the
> following code in ZFS on Linux. In IllumOS, the code is formatted
> differently but it's basically the same:
>
> uint64_t
> spa_get_asize(spa_t *spa, uint64_t lsize)
> {
> /*
>  * The worst case is single-sector max-parity RAID-Z blocks, in which
>  * case the space requirement is exactly (VDEV_RAIDZ_MAXPARITY + 1)
>  * times the size; so just assume that.  Add to this the fact that
>  * we can have up to 3 DVAs per bp, and one more factor of 2 because
>  * the block may be dittoed with up to 3 DVAs by ddt_sync().
>  */
> return (lsize * (VDEV_RAIDZ_MAXPARITY + 1) * SPA_DVAS_PER_BP * 2);
> }
>
> All in all this results in 4 * 3 * 2 = 24 times inflation of the value.
>
> If I have a 2GB quota on a file system, ZFS estimates my 100MB write to
> need 2.4GB space, which prevents them all from going into a single TXG.
> Instead, it decides to allow about 80MB (2GB/24) to go into the first TXG,
> and the rest has to wait until that first TXG is committed and ZFS realizes
> that 1.92GB (2GB-80MB) worth of space is still available.
>
> So now I'm trying to see if there's any way to improve this estimate. But
> I'm new to ZFS codebase, and so I'm looking for some help.
>
> - I'm not using raidz, so I should be able to knock off x4 right off the
> bat. Shoudn't there be a way to tell if the pool is using raidz by looking
> at spa->spa_root_vdev and calculate multiplication factors based on vdev
> tree? (And since vdev tree shape won't change that much, hopefully
> pre-calculate this value and store it in vdev_t)
>
> - My 100MB write is write system calls to a file, and my understanding is
> that for those ZFS wouldn't replicate blocks. So I'm wasting x3, too. What
> if we pass in more contextual parameters to determine proper block
> replication factor? If so, where can I learn the block replication policy
> in ZFS?
>
> - On the last x2 factor, it appears that "ddt_sync" is dedup related? If
> so, again is there any way to tell that dedup is not on and skip this? I
> suppose this is not a property of spa but of dsl_dataset (?). Is something
> like that feasible?
>
> Any insights/thoughts into this would be highly appreciated.
>
> --
> Kohsuke Kawaguchi
>
> _______________________________________________
> developer mailing list
> [email protected]
> http://lists.open-zfs.org/mailman/listinfo/developer
>
>
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to