On Mon, Sep 22, 2014 at 05:41:05PM +0100, Filipe Manana wrote: > While we have a transaction ongoing, the VM might decide at any time > to call btree_inode->i_mapping->a_ops->writepages(), which will start > writeback of dirty pages belonging to btree nodes/leafs. This call > might return an error or the writeback might finish with an error > before we attempt to commit the running transaction. If this happens, > we might have no way of knowing that such error happened when we are > committing the transaction - because the pages might no longer be > marked dirty nor tagged for writeback (if a subsequent modification > to the extent buffer didn't happen before the transaction commit) which > makes filemap_fdata[write|wait]_range unable to find such pages (even > if they're marked with SetPageError). > So if this happens we must abort the transaction, otherwise we commit > a super block with btree roots that point to btree nodes/leafs whose > content on disk is invalid - either garbage or the content of some > node/leaf from a past generation that got cowed or deleted and is no > longer valid (for this later case we end up getting error messages like > "parent transid verify failed on 10826481664 wanted 25748 found 29562" > when reading btree nodes/leafs from disk).
Good catch! > > Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's > i_mapping would not be enough because we need to distinguish between > log tree extents (not fatal) vs non-log tree extents (fatal) and > because the next call to filemap_fdatawait_range() will catch and clear > such errors in the mapping - and that call might be from a log sync and > not from a transaction commit, which means we would not know about the > error at transaction commit time. Also, checking for the eb flag > EXTENT_BUFFER_IOERR at transaction commit time isn't done and would > not be completely reliable, as the eb might be removed from memory and > read back when trying to get it, which clears that flag right before > reading the eb's pages from disk, making us not know about the previous > write error. > > Using the BTRFS_INODE_BTREE_IO_ERR and BTRFS_INODE_BTREE_LOG_IO_ERR > inode flags also makes us achieve the goal of AS_EIO/AS_ENOSPC when > writepages() returns success, started writeback for all dirty pages > and before filemap_fdatawait_range() is called, the writeback for > all dirty pages had already finished with errors - because we were > not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return > success, as it could not know that writeback errors happened (the > pages were no longer tagged for writeback). But there is a problem when the extent buffer with IO error is COWed and deleted, but the BTRFS_INODE_BTREE_IO_ERR flag still remains in btree_inode's runtime_flags, we'd still run into cleanup_transaction() as this patch expects. So I think we need to remove that BTREE_IO_ERR flag in this case. Well, I notice that you don't clear rest pages' DIRTY in the error case, so it can lead to hitting the BUG_ON in btrfs_release_extent_buffer_page() when the eb with IO error is COWed by a new good eb and gets itself freed then. I'm making a patch to add missing clear_page_dirty_for_io(). thanks, -liubo > > Signed-off-by: Filipe Manana <fdman...@suse.com> > --- > fs/btrfs/btrfs_inode.h | 2 ++ > fs/btrfs/extent_io.c | 69 > +++++++++++++++++++++++++++++++++++++++++++++++--- > fs/btrfs/transaction.c | 20 ++++++++++++--- > fs/btrfs/transaction.h | 3 +-- > fs/btrfs/tree-log.c | 13 ++++++---- > 5 files changed, 93 insertions(+), 14 deletions(-) > > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h > index 3511031..dbe37dc 100644 > --- a/fs/btrfs/btrfs_inode.h > +++ b/fs/btrfs/btrfs_inode.h > @@ -44,6 +44,8 @@ > #define BTRFS_INODE_IN_DELALLOC_LIST 9 > #define BTRFS_INODE_READDIO_NEED_LOCK 10 > #define BTRFS_INODE_HAS_PROPS 11 > +#define BTRFS_INODE_BTREE_IO_ERR 12 > +#define BTRFS_INODE_BTREE_LOG_IO_ERR 13 > > /* in memory btrfs inode */ > struct btrfs_inode { > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 91f866c..33b113b 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -20,6 +20,7 @@ > #include "locking.h" > #include "rcu-string.h" > #include "backref.h" > +#include "transaction.h" > > static struct kmem_cache *extent_state_cache; > static struct kmem_cache *extent_buffer_cache; > @@ -3606,6 +3607,68 @@ static void end_extent_buffer_writeback(struct > extent_buffer *eb) > wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK); > } > > +static void set_btree_ioerr(struct page *page, int err) > +{ > + struct extent_buffer *eb = (struct extent_buffer *)page->private; > + const u64 start = eb->start; > + const u64 end = eb->start + eb->len - 1; > + struct btrfs_fs_info *fs_info = eb->fs_info; > + int ret; > + > + set_bit(EXTENT_BUFFER_IOERR, &eb->bflags); > + SetPageError(page); > + > + /* > + * If writeback for a btree extent that doesn't belong to a log tree > + * failed, set the bit BTRFS_INODE_BTREE_IO_ERR in the inode btree. > + * We do this because while the transaction is running and before it's > + * committing (when we call filemap_fdata[write|wait]_range against > + * the btree inode), we might have > + * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it > + * returns an error or an error happens during writeback, when we're > + * committing the transaction we wouldn't know about it, since the pages > + * can be no longer dirty nor marked anymore for writeback (if a > + * subsequent modification to the extent buffer didn't happen before the > + * transaction commit), which makes filemap_fdata[write|wait]_range not > + * able to find the pages tagged with SetPageError at transaction > + * commit time. So if this happens we must abort the transaction, > + * otherwise we commit a super block with btree roots that point to > + * btree nodes/leafs whose content on disk is invalid - either garbage > + * or the content of some node/leaf from a past generation that got > + * cowed or deleted and is no longer valid. > + * > + * Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would > + * not be enough - we need to distinguish between log tree extents vs > + * non-log tree extents, and the next filemap_fdatawait_range() call > + * will catch and clear such errors in the mapping - and that call might > + * be from a log sync and not from a transaction commit. Also, checking > + * for the eb flag EXTENT_BUFFER_IOERR at transaction commit time isn't > + * done and would not be completely reliable, as the eb might be removed > + * from memory and read back when trying to get it, which clears that > + * flag right before reading the eb's pages from disk, making us not > + * know about the previous write error. > + * > + * Using the BTRFS_INODE_BTREE_IO_ERR and BTRFS_INODE_BTREE_LOG_IO_ERR > + * inode flags also makes us achieve the goal of AS_EIO/AS_ENOSPC when > + * writepages() returns success, started writeback for all dirty pages > + * and before filemap_fdatawait_range() is called, the writeback for > + * all dirty pages had already finished with errors - because we were > + * not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return > + * success, as it could not know that writeback errors happened (the > + * pages were no longer tagged for writeback). > + */ > + ASSERT(fs_info->running_transaction); > + ret = test_range_bit(&fs_info->running_transaction->dirty_pages, > + start, end, EXTENT_NEED_WAIT | EXTENT_DIRTY, > + 1, NULL); > + if (ret) > + set_bit(BTRFS_INODE_BTREE_IO_ERR, > + &BTRFS_I(fs_info->btree_inode)->runtime_flags); > + else > + set_bit(BTRFS_INODE_BTREE_LOG_IO_ERR, > + &BTRFS_I(fs_info->btree_inode)->runtime_flags); > +} > + > static void end_bio_extent_buffer_writepage(struct bio *bio, int err) > { > struct bio_vec *bvec; > @@ -3620,9 +3683,8 @@ static void end_bio_extent_buffer_writepage(struct bio > *bio, int err) > done = atomic_dec_and_test(&eb->io_pages); > > if (err || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) { > - set_bit(EXTENT_BUFFER_IOERR, &eb->bflags); > ClearPageUptodate(page); > - SetPageError(page); > + set_btree_ioerr(page, err < 0 ? err : -EIO); > } > > end_page_writeback(page); > @@ -3666,8 +3728,7 @@ static noinline_for_stack int write_one_eb(struct > extent_buffer *eb, > 0, epd->bio_flags, bio_flags); > epd->bio_flags = bio_flags; > if (ret) { > - set_bit(EXTENT_BUFFER_IOERR, &eb->bflags); > - SetPageError(p); > + set_btree_ioerr(p, ret); > end_page_writeback(p); > if (atomic_sub_and_test(num_pages - i, &eb->io_pages)) > end_extent_buffer_writeback(eb); > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index 1e272c0..f17829a 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -844,6 +844,7 @@ int btrfs_write_marked_extents(struct btrfs_root *root, > * on all the pages and clear them from the dirty pages state tree > */ > int btrfs_wait_marked_extents(struct btrfs_root *root, > + struct btrfs_trans_handle *trans, > struct extent_io_tree *dirty_pages, int mark) > { > int err = 0; > @@ -852,6 +853,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root, > struct extent_state *cached_state = NULL; > u64 start = 0; > u64 end; > + struct inode *btree_inode = root->fs_info->btree_inode; > > while (!find_first_extent_bit(dirty_pages, start, &start, &end, > EXTENT_NEED_WAIT, &cached_state)) { > @@ -865,6 +867,17 @@ int btrfs_wait_marked_extents(struct btrfs_root *root, > } > if (err) > werr = err; > + > + if (dirty_pages == &trans->transaction->dirty_pages) { > + if (test_and_clear_bit(BTRFS_INODE_BTREE_IO_ERR, > + &BTRFS_I(btree_inode)->runtime_flags)) > + werr = werr ? werr : -EIO; > + } else { > + if (test_and_clear_bit(BTRFS_INODE_BTREE_LOG_IO_ERR, > + &BTRFS_I(btree_inode)->runtime_flags)) > + werr = werr ? werr : -EIO; > + } > + > return werr; > } > > @@ -874,6 +887,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root, > * those extents are on disk for transaction or log commit > */ > static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root, > + struct btrfs_trans_handle *trans, > struct extent_io_tree *dirty_pages, int mark) > { > int ret; > @@ -883,7 +897,7 @@ static int btrfs_write_and_wait_marked_extents(struct > btrfs_root *root, > blk_start_plug(&plug); > ret = btrfs_write_marked_extents(root, dirty_pages, mark); > blk_finish_plug(&plug); > - ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark); > + ret2 = btrfs_wait_marked_extents(root, trans, dirty_pages, mark); > > if (ret) > return ret; > @@ -892,7 +906,7 @@ static int btrfs_write_and_wait_marked_extents(struct > btrfs_root *root, > return 0; > } > > -int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans, > +static int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans, > struct btrfs_root *root) > { > if (!trans || !trans->transaction) { > @@ -900,7 +914,7 @@ int btrfs_write_and_wait_transaction(struct > btrfs_trans_handle *trans, > btree_inode = root->fs_info->btree_inode; > return filemap_write_and_wait(btree_inode->i_mapping); > } > - return btrfs_write_and_wait_marked_extents(root, > + return btrfs_write_and_wait_marked_extents(root, trans, > &trans->transaction->dirty_pages, > EXTENT_DIRTY); > } > diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h > index 7dd558e..78b754a 100644 > --- a/fs/btrfs/transaction.h > +++ b/fs/btrfs/transaction.h > @@ -146,8 +146,6 @@ struct btrfs_trans_handle > *btrfs_attach_transaction_barrier( > struct btrfs_root *root); > struct btrfs_trans_handle *btrfs_start_ioctl_transaction(struct btrfs_root > *root); > int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid); > -int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans, > - struct btrfs_root *root); > > void btrfs_add_dead_root(struct btrfs_root *root); > int btrfs_defrag_root(struct btrfs_root *root); > @@ -167,6 +165,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle > *trans, > int btrfs_write_marked_extents(struct btrfs_root *root, > struct extent_io_tree *dirty_pages, int mark); > int btrfs_wait_marked_extents(struct btrfs_root *root, > + struct btrfs_trans_handle *trans, > struct extent_io_tree *dirty_pages, int mark); > int btrfs_transaction_blocked(struct btrfs_fs_info *info); > int btrfs_transaction_in_commit(struct btrfs_fs_info *info); > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c > index 2d0fa43..22ffd32 100644 > --- a/fs/btrfs/tree-log.c > +++ b/fs/btrfs/tree-log.c > @@ -2583,7 +2583,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, > mutex_unlock(&log_root_tree->log_mutex); > goto out; > } > - btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark); > + btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages, > + mark); > btrfs_free_logged_extents(log, log_transid); > mutex_unlock(&log_root_tree->log_mutex); > ret = -EAGAIN; > @@ -2599,7 +2600,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, > index2 = root_log_ctx.log_transid % 2; > if (atomic_read(&log_root_tree->log_commit[index2])) { > blk_finish_plug(&plug); > - btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark); > + btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages, > + mark); > wait_log_commit(trans, log_root_tree, > root_log_ctx.log_transid); > btrfs_free_logged_extents(log, log_transid); > @@ -2623,7 +2625,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, > */ > if (btrfs_need_log_full_commit(root->fs_info, trans)) { > blk_finish_plug(&plug); > - btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark); > + btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages, > + mark); > btrfs_free_logged_extents(log, log_transid); > mutex_unlock(&log_root_tree->log_mutex); > ret = -EAGAIN; > @@ -2641,8 +2644,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, > mutex_unlock(&log_root_tree->log_mutex); > goto out_wake_log_root; > } > - btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark); > - btrfs_wait_marked_extents(log_root_tree, > + btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages, mark); > + btrfs_wait_marked_extents(log_root_tree, trans, > &log_root_tree->dirty_log_pages, > EXTENT_NEW | EXTENT_DIRTY); > btrfs_wait_logged_extents(log, log_transid); > -- > 1.9.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html