On Mon, Feb 27, 2017 at 4:14 PM, Filipe Manana <[email protected]> wrote:
Also, forgot to mention before, looking at the subject, the term deadlock is not correct, as it's a hang. A deadlock is when a task is trying to acquire a resource (typically a lock) that is already held by some other task that in turn it's trying to acquire the same resource as the former task. > On Fri, Feb 24, 2017 at 2:06 AM, Qu Wenruo <[email protected]> wrote: >> If run btrfs/125 with nospace_cache or space_cache=v2 mount option, >> btrfs will block with the following backtrace: > > Happens with btrfs/124 without any mount options too. > >> >> Call Trace: >> __schedule+0x2d4/0xae0 >> schedule+0x3d/0x90 >> btrfs_start_ordered_extent+0x160/0x200 [btrfs] >> ? wake_atomic_t_function+0x60/0x60 >> btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] >> btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] >> btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] >> process_one_work+0x2af/0x720 >> ? process_one_work+0x22b/0x720 >> worker_thread+0x4b/0x4f0 >> kthread+0x10f/0x150 >> ? process_one_work+0x720/0x720 >> ? kthread_create_on_node+0x40/0x40 >> ret_from_fork+0x2e/0x40 >> >> The direct cause is the error handler in run_delalloc_nocow() doesn't >> handle error from btrfs_reloc_clone_csums() well. >> >> The error handler of run_delalloc_nocow() will clear dirty and finish IO >> for the pages in that extent. >> However we have already inserted one ordered extent. >> And that ordered extent is relying on endio hooks to wait all its pages >> to finish, while only the first page will finish. >> >> This makes that ordered extent never finish, so blocking the file >> system. >> >> Although the root cause is still in RAID5/6, it won't hurt to fix the >> error routine first. > > This is a higher level problem independent of the raid level being > used. So this changelog gives the wrong idea that this happens only > with raid5/6. > >> >> This patch will slightly modify one existing function, >> btrfs_endio_direct_write_update_ordered() to handle free space inode, >> and skip releasing metadata, which will be handled by >> extent_clear_unlock_delalloc(). >> >> And use it as base to implement one inline function, >> btrfs_cleanup_ordered_extents() to handle the error in >> run_delalloc_nocow() and cow_file_range(). >> >> Also, extent_clear_unlock_delalloc() will handle all the metadata >> release, so btrfs_cleanup_ordered_extents() doesn't need to do it. >> >> For compression, it's calling writepage_end_io_hook() itself to handle >> its error, and any submitted ordered extent will have its bio submitted, >> so no need to worry about compression part. >> >> Suggested-by: Filipe Manana <[email protected]> >> Signed-off-by: Qu Wenruo <[email protected]> >> --- >> v2: >> Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing >> outstanding extents, which is already done by >> extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit >> v3: >> Skip first page to avoid underflow ordered->bytes_left. >> Fix range passed in cow_file_range() which doesn't cover the whole >> extent. >> Expend extent_clear_unlock_delalloc() range to allow them to handle >> metadata release. >> v4: >> Don't use extra bit to skip metadata freeing for ordered extent, >> but only handle btrfs_reloc_clone_csums() error just before processing >> next extent. >> This makes error handle much easier for run_delalloc_nocow(). >> --- >> fs/btrfs/extent_io.c | 1 - >> fs/btrfs/inode.c | 112 >> +++++++++++++++++++++++++++++++++++++++++++-------- >> 2 files changed, 96 insertions(+), 17 deletions(-) >> >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >> index 4ac383a3a649..a14d1b0840c5 100644 >> --- a/fs/btrfs/extent_io.c >> +++ b/fs/btrfs/extent_io.c >> @@ -3258,7 +3258,6 @@ static noinline_for_stack int >> writepage_delalloc(struct inode *inode, >> delalloc_end, >> &page_started, >> nr_written); >> - /* File system has been set read-only */ > > This is unrelated to the problem you're solving. Should be an > independent and unrelated patch. > >> if (ret) { >> SetPageError(page); >> /* fill_delalloc should be return < 0 for error >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index 1e861a063721..b98a92807aa2 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -116,6 +116,34 @@ static struct extent_map *create_pinned_em(struct inode >> *inode, u64 start, >> >> static int btrfs_dirty_inode(struct inode *inode); >> >> + >> +static void __endio_write_update_ordered(struct inode *inode, >> + const u64 offset, const u64 bytes, >> + bool uptodate, bool cleanup); > > Offset and bytes are const, why not the booleans as well? It's > inconsistent and confusing. > Also leaving a blank line before the declaration of next function > would make the code easier to the eyes. > >> +static inline void btrfs_endio_direct_write_update_ordered(struct inode >> *inode, >> + const u64 offset, >> + const u64 bytes, >> + const int >> uptodate) > > Now uptodate is an int and not bool as in the previous call that it > ends up calling. At least within the same patch we should be > consistent. > >> +{ >> + return __endio_write_update_ordered(inode, offset, bytes, uptodate, >> false); >> +} >> + >> +/* >> + * Cleanup all submitted ordered extent in specified range to handle error > > extent -> extents > >> + * in cow_file_range() and run_delalloc_nocow(). >> + * Compression handles error and ordered extent submission all by >> themselves, > > all by themselves -> by itself > >> + * so no need to call this function. >> + * >> + * NOTE: caller must ensure extent_clear_unlock_delalloc() in error handler >> + * doesn't cover any range of submitted ordered extent. >> + * Or we will double free metadata for submitted ordered extent. >> + */ >> +static inline void btrfs_cleanup_ordered_extents(struct inode *inode, >> + u64 offset, u64 bytes) >> +{ >> + return __endio_write_update_ordered(inode, offset, bytes, false, >> true); >> +} >> + >> #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS >> void btrfs_test_inode_set_ops(struct inode *inode) >> { >> @@ -950,6 +978,7 @@ static noinline int cow_file_range(struct inode *inode, >> u64 disk_num_bytes; >> u64 cur_alloc_size; >> u64 blocksize = fs_info->sectorsize; >> + u64 orig_start = start; >> struct btrfs_key ins; >> struct extent_map *em; >> struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; >> @@ -1052,15 +1081,22 @@ static noinline int cow_file_range(struct inode >> *inode, >> BTRFS_DATA_RELOC_TREE_OBJECTID) { >> ret = btrfs_reloc_clone_csums(inode, start, >> cur_alloc_size); >> + /* >> + * Only drop cache here, and process as normal. >> + * >> + * We must not allow extent_clear_unlock_delalloc() >> + * to free meta of this ordered extent, as its >> + * meta should be freed by btrfs_finish_ordered_io(). >> + * >> + * So we must continue until processing next extent >> + */ >> if (ret) >> - goto out_drop_extent_cache; >> + btrfs_drop_extent_cache(inode, start, >> + start + ram_size - 1, 0); >> } >> >> btrfs_dec_block_group_reservations(fs_info, ins.objectid); >> >> - if (disk_num_bytes < cur_alloc_size) >> - break; >> - > > If I understanding correctly you're removing this just to make sure we > call extent_clear_unlock_delalloc() below. > However later on, we should still break if this condition is true, > otherwise this code is not equivalent to what we had before. > >> /* we're not doing compressed IO, don't unlock the first >> * page (which the caller expects to stay locked), don't >> * clear any dirty bits and don't set any writeback bits >> @@ -1076,10 +1112,21 @@ static noinline int cow_file_range(struct inode >> *inode, >> delalloc_end, locked_page, >> EXTENT_LOCKED | EXTENT_DELALLOC, >> op); >> - disk_num_bytes -= cur_alloc_size; >> + if (disk_num_bytes < cur_alloc_size) >> + disk_num_bytes = 0; >> + else >> + disk_num_bytes -= cur_alloc_size; >> num_bytes -= cur_alloc_size; >> alloc_hint = ins.objectid + ins.offset; >> start += cur_alloc_size; >> + >> + /* >> + * btrfs_reloc_clone_csums() error, now >> + * extent_clear_unlock_delalloc() in out_unlock() won't >> + * double free metadata of current oredered extent. > > in out_unlock() -> at out_lock label > (it's not a function, should not have the parenthesis following the name) > > oredered -> ordered > >> + */ >> + if (ret) >> + goto out_reserve; > > Jumping to this label will make us decrement block group reservations > twice - i.e. btrfs_dec_block_group_reservations() is now called twice > instead of once, allowing for an underflow to happen. > > And btrfs_free_reserved_extent() will be called twice for the reserved > extent, once at the "out_reserve" label and once when the ordered > extent finishes (with the error bit set), at > btrfs_finish_ordered_io(). This is particularly dangerous, as we are > in a transaction-less context, so that means once the first free call > happens, the same space can be allocated elsewhere by another task and > then when the ordered extent finishes we incorrectly free that space > again. > >> } >> out: >> return ret; >> @@ -1096,6 +1143,7 @@ static noinline int cow_file_range(struct inode *inode, >> EXTENT_DELALLOC | EXTENT_DEFRAG, >> PAGE_UNLOCK | PAGE_CLEAR_DIRTY | >> PAGE_SET_WRITEBACK | >> PAGE_END_WRITEBACK); >> + btrfs_cleanup_ordered_extents(inode, orig_start, end - orig_start + >> 1); >> goto out; >> } >> >> @@ -1496,15 +1544,14 @@ static noinline int run_delalloc_nocow(struct inode >> *inode, >> BUG_ON(ret); /* -ENOMEM */ >> >> if (root->root_key.objectid == >> - BTRFS_DATA_RELOC_TREE_OBJECTID) { >> + BTRFS_DATA_RELOC_TREE_OBJECTID) >> + /* >> + * Error handled later, as we must prevent >> + * extent_clear_unlock_delalloc() in error handler >> + * from freeing metadata of submitted ordered extent. >> + */ >> ret = btrfs_reloc_clone_csums(inode, cur_offset, >> num_bytes); >> - if (ret) { >> - if (!nolock && nocow) >> - btrfs_end_write_no_snapshoting(root); >> - goto error; >> - } >> - } >> >> extent_clear_unlock_delalloc(inode, cur_offset, >> cur_offset + num_bytes - 1, end, >> @@ -1516,6 +1563,14 @@ static noinline int run_delalloc_nocow(struct inode >> *inode, >> if (!nolock && nocow) >> btrfs_end_write_no_snapshoting(root); >> cur_offset = extent_end; >> + >> + /* >> + * btrfs_reloc_clone_csums() error, now we're OK to call >> error >> + * handler, as metadata for submitted ordered extent will >> only >> + * be freed by btrfs_finish_ordered_io(). >> + */ >> + if (ret) >> + goto error; >> if (cur_offset > end) >> break; >> } >> @@ -1546,6 +1601,12 @@ static noinline int run_delalloc_nocow(struct inode >> *inode, >> PAGE_CLEAR_DIRTY | >> PAGE_SET_WRITEBACK | >> PAGE_END_WRITEBACK); >> + /* >> + * It's possible that last ordered extent covered the last part >> + * but failed. In that case we still need to clean them up. >> + */ > > Confusing comment. You mention ordered extent (singular) but then > later mention "clean them up" (plural). Also saying that an ordered > extent failed gives the idea that the IO failed (i.e. bio finished > with an error), while what you want to say is that we created one or > more ordered extents but failed somewhere before submitting bios for > them. I would just remove the comment, or the very least make it > understandable. > >> + if (ret) >> + btrfs_cleanup_ordered_extents(inode, start, end - start + 1); >> btrfs_free_path(path); >> return ret; >> } >> @@ -8185,17 +8246,36 @@ static void btrfs_endio_direct_read(struct bio *bio) >> bio_put(bio); >> } >> >> -static void btrfs_endio_direct_write_update_ordered(struct inode *inode, >> - const u64 offset, >> - const u64 bytes, >> - const int uptodate) >> +static void __endio_write_update_ordered(struct inode *inode, >> + const u64 offset, const u64 bytes, >> + bool uptodate, bool cleanup) >> { >> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); >> struct btrfs_ordered_extent *ordered = NULL; >> + struct btrfs_workqueue *wq; >> + btrfs_work_func_t func; >> u64 ordered_offset = offset; >> u64 ordered_bytes = bytes; >> int ret; >> >> + if (btrfs_is_free_space_inode(inode)) { >> + wq = fs_info->endio_freespace_worker; >> + func = btrfs_freespace_write_helper; >> + } else { >> + wq = fs_info->endio_write_workers; >> + func = btrfs_endio_write_helper; >> + } > > What is going on here? wq and func are set but not used anywhere. > >> + >> + /* >> + * In cleanup case, the first page of the range will be handled >> + * by end_extent_writepage() under done tag of __extent_writepage(). > > This sentence is confusing, "under done tag"? > Just say: > > In cleanup case, the first page of the range will be handled by > end_extent_writepage() when called from __extent_writepage(). > > thanks > >> + * >> + * So we must skip first page, or we will underflow >> ordered->bytes_left >> + */ >> + if (cleanup) { >> + ordered_offset += PAGE_SIZE; >> + ordered_bytes -= PAGE_SIZE; >> + } >> again: >> ret = btrfs_dec_test_first_ordered_pending(inode, &ordered, >> &ordered_offset, >> -- >> 2.11.1 >> >> >> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
