That all sounds about right to me. In illumos we have changed this to a tunable, so that you may override the "inflation factor" if, for example, you know that you are not using RAID-Z. Ideally the code would be changed to do this automatically (take into account whether you are using RAID-Z, ditto blocks, and dedup).
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/spa_misc.c#275 --matt On Sat, Nov 30, 2013 at 10:10 AM, Kohsuke Kawaguchi <[email protected]> wrote: > In my quest for understanding why my TXG size is so small, I found the > following code in ZFS on Linux. In IllumOS, the code is formatted > differently but it's basically the same: > > uint64_t > spa_get_asize(spa_t *spa, uint64_t lsize) > { > /* > * The worst case is single-sector max-parity RAID-Z blocks, in which > * case the space requirement is exactly (VDEV_RAIDZ_MAXPARITY + 1) > * times the size; so just assume that. Add to this the fact that > * we can have up to 3 DVAs per bp, and one more factor of 2 because > * the block may be dittoed with up to 3 DVAs by ddt_sync(). > */ > return (lsize * (VDEV_RAIDZ_MAXPARITY + 1) * SPA_DVAS_PER_BP * 2); > } > > All in all this results in 4 * 3 * 2 = 24 times inflation of the value. > > If I have a 2GB quota on a file system, ZFS estimates my 100MB write to > need 2.4GB space, which prevents them all from going into a single TXG. > Instead, it decides to allow about 80MB (2GB/24) to go into the first TXG, > and the rest has to wait until that first TXG is committed and ZFS realizes > that 1.92GB (2GB-80MB) worth of space is still available. > > So now I'm trying to see if there's any way to improve this estimate. But > I'm new to ZFS codebase, and so I'm looking for some help. > > - I'm not using raidz, so I should be able to knock off x4 right off the > bat. Shoudn't there be a way to tell if the pool is using raidz by looking > at spa->spa_root_vdev and calculate multiplication factors based on vdev > tree? (And since vdev tree shape won't change that much, hopefully > pre-calculate this value and store it in vdev_t) > > - My 100MB write is write system calls to a file, and my understanding is > that for those ZFS wouldn't replicate blocks. So I'm wasting x3, too. What > if we pass in more contextual parameters to determine proper block > replication factor? If so, where can I learn the block replication policy > in ZFS? > > - On the last x2 factor, it appears that "ddt_sync" is dedup related? If > so, again is there any way to tell that dedup is not on and skip this? I > suppose this is not a property of spa but of dsl_dataset (?). Is something > like that feasible? > > Any insights/thoughts into this would be highly appreciated. > > -- > Kohsuke Kawaguchi > > _______________________________________________ > developer mailing list > [email protected] > http://lists.open-zfs.org/mailman/listinfo/developer > >
_______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
