Some more specific suggestions below:

On Sat, Nov 30, 2013 at 10:10 AM, Kohsuke Kawaguchi <[email protected]> wrote:

> In my quest for understanding why my TXG size is so small, I found the
> following code in ZFS on Linux. In IllumOS, the code is formatted
> differently but it's basically the same:
>
> uint64_t
> spa_get_asize(spa_t *spa, uint64_t lsize)
> {
> /*
>  * The worst case is single-sector max-parity RAID-Z blocks, in which
>  * case the space requirement is exactly (VDEV_RAIDZ_MAXPARITY + 1)
>  * times the size; so just assume that.  Add to this the fact that
>  * we can have up to 3 DVAs per bp, and one more factor of 2 because
>  * the block may be dittoed with up to 3 DVAs by ddt_sync().
>  */
> return (lsize * (VDEV_RAIDZ_MAXPARITY + 1) * SPA_DVAS_PER_BP * 2);
> }
>
> All in all this results in 4 * 3 * 2 = 24 times inflation of the value.
>
> If I have a 2GB quota on a file system, ZFS estimates my 100MB write to
> need 2.4GB space, which prevents them all from going into a single TXG.
> Instead, it decides to allow about 80MB (2GB/24) to go into the first TXG,
> and the rest has to wait until that first TXG is committed and ZFS realizes
> that 1.92GB (2GB-80MB) worth of space is still available.
>
> So now I'm trying to see if there's any way to improve this estimate. But
> I'm new to ZFS codebase, and so I'm looking for some help.
>
> - I'm not using raidz, so I should be able to knock off x4 right off the
> bat. Shoudn't there be a way to tell if the pool is using raidz by looking
> at spa->spa_root_vdev and calculate multiplication factors based on vdev
> tree? (And since vdev tree shape won't change that much, hopefully
> pre-calculate this value and store it in vdev_t)
>

Yes.  You will need to look at all top-level vdevs (i.e.
spa_root_vdev->vdev_children[]) and see if they are using RAID-Z.  You
would probably want to cache this in the spa_t.  You will need to
re-evaluate when a new top-level vdev is added.


>
> - My 100MB write is write system calls to a file, and my understanding is
> that for those ZFS wouldn't replicate blocks. So I'm wasting x3, too. What
> if we pass in more contextual parameters to determine proper block
> replication factor? If so, where can I learn the block replication policy
> in ZFS?
>

This will be trickier, because some blocks (metadata) will be dittoed, and
others will not.  The dmu_tx_hold_* code does not currently make this
distinction.  The ditto policy is implemented in dmu_write_policy(), which
is called way after you would need this information.


>
> - On the last x2 factor, it appears that "ddt_sync" is dedup related? If
> so, again is there any way to tell that dedup is not on and skip this? I
> suppose this is not a property of spa but of dsl_dataset (?). Is something
> like that feasible?
>

That's right.  You could either see if dedup is used anywhere in the pool,
or you could see if the property is set on this particular dataset.  The
dataset is readily available to all callers of spa_get_asize.  The logic
for determining if dedup is used is also in dmu_write_policy().  You would
need to replicate this logic in callers of spa_get_asize(), but you can use
a simplified version, probably just "os_dedup_checksum != ZIO_CHECKSUM_OFF".


>
> Any insights/thoughts into this would be highly appreciated.
>
> --
> Kohsuke Kawaguchi
>
> _______________________________________________
> developer mailing list
> [email protected]
> http://lists.open-zfs.org/mailman/listinfo/developer
>
>
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to