In my quest for understanding why my TXG size is so small, I found the
following code in ZFS on Linux. In IllumOS, the code is formatted
differently but it's basically the same:

uint64_t
spa_get_asize(spa_t *spa, uint64_t lsize)
{
/*
 * The worst case is single-sector max-parity RAID-Z blocks, in which
 * case the space requirement is exactly (VDEV_RAIDZ_MAXPARITY + 1)
 * times the size; so just assume that.  Add to this the fact that
 * we can have up to 3 DVAs per bp, and one more factor of 2 because
 * the block may be dittoed with up to 3 DVAs by ddt_sync().
 */
return (lsize * (VDEV_RAIDZ_MAXPARITY + 1) * SPA_DVAS_PER_BP * 2);
}

All in all this results in 4 * 3 * 2 = 24 times inflation of the value.

If I have a 2GB quota on a file system, ZFS estimates my 100MB write to
need 2.4GB space, which prevents them all from going into a single TXG.
Instead, it decides to allow about 80MB (2GB/24) to go into the first TXG,
and the rest has to wait until that first TXG is committed and ZFS realizes
that 1.92GB (2GB-80MB) worth of space is still available.

So now I'm trying to see if there's any way to improve this estimate. But
I'm new to ZFS codebase, and so I'm looking for some help.

- I'm not using raidz, so I should be able to knock off x4 right off the
bat. Shoudn't there be a way to tell if the pool is using raidz by looking
at spa->spa_root_vdev and calculate multiplication factors based on vdev
tree? (And since vdev tree shape won't change that much, hopefully
pre-calculate this value and store it in vdev_t)

- My 100MB write is write system calls to a file, and my understanding is
that for those ZFS wouldn't replicate blocks. So I'm wasting x3, too. What
if we pass in more contextual parameters to determine proper block
replication factor? If so, where can I learn the block replication policy
in ZFS?

- On the last x2 factor, it appears that "ddt_sync" is dedup related? If
so, again is there any way to tell that dedup is not on and skip this? I
suppose this is not a property of spa but of dsl_dataset (?). Is something
like that feasible?

Any insights/thoughts into this would be highly appreciated.

-- 
Kohsuke Kawaguchi
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to