On 2018年01月02日 15:51, robbieko wrote:
> Hi All,
> 
> When testing Btrfs with fio 4k random write, I found that volume with
> smaller free space available has lower performance.
> 
> It seems that the smaller the free space of volume is, the smaller
> amount of dirty page filesystem could have.
> There are only 6 MB dirty pages when free space of volume is only 10GB
> with 16KB-nodesize and cow disabled.
> 
> btrfs will reserve metadata for every write.
> The amount to reserve is calculated as follows: nodesize *
> BTRFS_MAX_LEVEL(8) * 2, i.e., it reserves 256KB of metadata.
> The maximum amount of metadata reservation depends on size of metadata
> currently in used and free space within volume(free chunk size /16)
> When metadata reaches the limit, btrfs will need to flush the data to
> release the reservation.
> 
> 1. Is there any logic behind the value (free chunk size /16)
> 
>  /*
>   * If we have dup, raid1 or raid10 then only half of the free
>   * space is actually useable. For raid56, the space info used
>   * doesn't include the parity drive, so we don't have to
>   * change the math
>   */
>  if (profile & (BTRFS_BLOCK_GROUP_DUP |
>          BTRFS_BLOCK_GROUP_RAID1 |
>          BTRFS_BLOCK_GROUP_RAID10))
>   avail >>= 1;
> 
>  /*
>   * If we aren't flushing all things, let us overcommit up to
>   * 1/2th of the space. If we can flush, don't let us overcommit
>   * too much, let it overcommit up to 1/8 of the space.
>   */
>  if (flush == BTRFS_RESERVE_FLUSH_ALL)
>   avail >>= 3;
>  else
>   avail >>= 1;
> 
> 2. Is there any way to improve this problem?

One solution is to reduce the metadata reserve.

I experienced similar problem, although in qgroup metadata reservation
other than extent level reservation.

In qgroup respect, it only needs to care about the "net" reservation it
needs, so qgroup rsv can do a special calculation, get rid of the
over-killed nodesize * BTRFS_MAX_LEVEL(8) * 2.

But for extent allocator, it must ensure that there is enough space for
delalloc, so such qgroup trick can't be applied directly.


Although we can still reduce the amount a lot.

Personally speaking, btrfs may only need at most (current_tree_level +
2) * nodesize for each meta rsv.
(One for possible tree level increase, and one for extra leaf split).

And specially for delalloc, all delalloc result (file extents) are
adjust to each other, which could further reduce the meta reservation.

For example, if there are 4 outstanding extents to be written to disk,
we only need (current_tree_level + 2) *nodesize.
As even if we need to increase the tree level *AND* split tree, we will
definitely have enough free space to contain just 4 file extents.
(After a rough calculation, we can keep that meta rsv until outstanding
extent exceed 307)

So we could definitely reduce meta rsv amount needed to reduce
unnecessary flush.

Thanks,
Qu

> 
> Thanks.
> Robbie Ko
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to