hi,

On 10/14/2016 09:59 PM, Holger Hoffstätte wrote:
On 10/06/16 04:51, Wang Xiaoguang wrote:
When testing btrfs compression, sometimes we got ENOSPC error, though fs
still has much free space, xfstests generic/171, generic/172, generic/173,
generic/174, generic/175 can reveal this bug in my test environment when
compression is enabled.

After some debuging work, we found that it's btrfs_delalloc_reserve_metadata()
which sometimes tries to reserve plenty of metadata space, even for very small
data range. In btrfs_delalloc_reserve_metadata(), the number of metadata bytes
we try to reserve is calculated by the difference between outstanding_extents
and reserved_extents. Please see below case for how ENOSPC occurs:

   1, Buffered write 128MB data in unit of 128KB, so finially we'll have inode
outstanding extents be 1, and reserved_extents be 1024. Note it's
btrfs_merge_extent_hook() that merges these 128KB units into one big
outstanding extent, but do not change reserved_extents.

   2, When writing dirty pages, for compression, cow_file_range_async() will
split above big extent in unit of 128KB(compression extent size is 128KB).
When first split opeartion finishes, we'll have 2 outstanding extents and 1024
reserved extents, and just right now the currently generated ordered extent is
dispatched to run and complete, then btrfs_delalloc_release_metadata()(see
btrfs_finish_ordered_io()) will be called to release metadata, after that we
will have 1 outstanding extents and 1 reserved extents(also see logic in
drop_outstanding_extent()). Later cow_file_range_async() continues to handles
left data range[128KB, 128MB), and if no other ordered extent was dispatched
to run, there will be 1023 outstanding extents and 1 reserved extent.

   3, Now if another bufferd write for this file enters, then
btrfs_delalloc_reserve_metadata() will at least try to reserve metadata
for 1023 outstanding extents' metadata, for 16KB node size, it'll be 
1023*16384*2*8,
about 255MB, for 64K node size, it'll be 1023*65536*8*2, about 1GB metadata, so
obviously it's not sane and can easily result in enospc error.

The root cause is that for compression, its max extent size will no longer be
BTRFS_MAX_EXTENT_SIZE(128MB), it'll be 128KB, so current metadata reservation
method in btrfs is not appropriate or correct, here we introduce:
        enum btrfs_metadata_reserve_type {
                BTRFS_RESERVE_NORMAL,
                BTRFS_RESERVE_COMPRESS,
        };
and expand btrfs_delalloc_reserve_metadata() and btrfs_delalloc_reserve_space()
by adding a new enum btrfs_metadata_reserve_type argument. When a data range 
will
go through compression, we use BTRFS_RESERVE_COMPRESS to reserve metatata.
Meanwhile we introduce EXTENT_COMPRESS flag to mark a data range that will go
through compression path.

With this patch, we can fix these false enospc error for compression.

Signed-off-by: Wang Xiaoguang <wangxg.f...@cn.fujitsu.com>
I took some time again to get this into my tree on top of what's in
btrfs-4.9rc1 and managed to merge it after all.

Both this and patch #1 seem to work fine, and they don't seem to cause any
regressions; ran a couple of both full and incremental rsync backups with
100GB on a new and now compressed subvolume without problem. Also Stefan
just reported that his ENOSPC seems to be gone as well, so it seems to be
good. \o/

So for both this and patch #1 have a careful:

Tested-by: Holger Hoffstätte <hol...@applied-asynchrony.com>

Also a comment about something I found while resolving conflicts caused
by the preliminary dedupe suppoort:

[..]
  int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
-                             struct extent_state **cached_state);
+                             struct extent_state **cached_state, int flag);
  int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end,
-                           struct extent_state **cached_state);
+                           struct extent_state **cached_state, int flag);
[..]
  int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
                      struct page **pages, size_t num_pages,
                      loff_t pos, size_t write_bytes,
-                     struct extent_state **cached);
+                     struct extent_state **cached, int flag);
Instead of adding "int flag" why not use the already defined
btrfs_metadata_reserve_type enum? I know it's just an int at the end of
the day, but the dedupe support already added another "int dedupe" argument
and it's probably easy to cause confusion.
Maybe later it would be beneficial to consolidate the flags into a consistent
set of enum values to prevent more "int flag" inflation and better declare the
intent of the extent state change. Not sure if that makes sense.
Yes, agree.
I'll rebase them later, thanks.

Regards,
Xiaoguang Wang


thanks,
Holger






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to