Hi Jun,

On 2018/1/27 16:28, piaojun wrote:
> We should not reuse the dirty bh in jbd2 directly due to the following
> situation:
> 
> 1. When removing extent rec, we will dirty the bhs of extent rec and
>     truncate log at the same time, and hand them over to jbd2.
> 2. The bhs are submitted to jbd2 area successfully.
> 3. The write-back thread of device help flush the bhs to disk but
>     encounter write error due to abnormal storage link.
> 4. After a while the storage link become normal. Truncate log flush
>     worker triggered by the next space reclaiming found the dirty bh of
>     truncate log and clear its 'BH_Write_EIO' and then set it uptodate in
>     __ocfs2_journal_access():
> 
> ocfs2_truncate_log_worker
>    ocfs2_flush_truncate_log
>      __ocfs2_flush_truncate_log
>        ocfs2_replay_truncate_records
>          ocfs2_journal_access_di
>            __ocfs2_journal_access // here we clear io_error and set 'tl_bh' 
> uptodata.
> 
> 5. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>     extent rec is still in error state, and unfortunately nobody will
>     take care of it.
> 6. At last the space of extent rec was not reduced, but truncate log
>     flush worker have given it back to globalalloc. That will cause
>     duplicate cluster problem which could be identified by fsck.ocfs2.
> 
> Sadlly we can hardly revert this but set fs read-only in case of
> ruining atomicity and consistency of space reclaim.
> 
> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in 
> __ocfs2_journal_access")
> 
> Signed-off-by: Jun Piao <piao...@huawei.com>
> Reviewed-by: Yiwen Jiang <jiangyi...@huawei.com>
> ---
>   fs/ocfs2/journal.c | 49 ++++++++++++++++++++++++++++++++++++++-----------
>   1 file changed, 38 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> index 3630443..4c5661c 100644
> --- a/fs/ocfs2/journal.c
> +++ b/fs/ocfs2/journal.c
> @@ -666,23 +666,50 @@ static int __ocfs2_journal_access(handle_t *handle,
>       /* we can safely remove this assertion after testing. */
>       if (!buffer_uptodate(bh)) {
>               mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
> -             mlog(ML_ERROR, "b_blocknr=%llu\n",
> -                  (unsigned long long)bh->b_blocknr);
> +             mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
> +                  (unsigned long long)bh->b_blocknr, bh->b_state);
> 
>               lock_buffer(bh);
>               /*
> -              * A previous attempt to write this buffer head failed.
> -              * Nothing we can do but to retry the write and hope for
> -              * the best.
> +              * We should not reuse the dirty bh directly due to the
> +              * following situation:
> +              *
> +              * 1. When removing extent rec, we will dirty the bhs of
> +              *    extent rec and truncate log at the same time, and
> +              *    hand them over to jbd2.
> +              * 2. The bhs are submitted to jbd2 area successfully.
> +              * 3. The write-back thread of device help flush the bhs
> +              *    to disk but encounter write error due to abnormal
> +              *    storage link.
> +              * 4. After a while the storage link become normal.
> +              *    Truncate log flush worker triggered by the next
> +              *    space reclaiming found the dirty bh of truncate log
> +              *    and clear its 'BH_Write_EIO' and then set it uptodate
> +              *    in __ocfs2_journal_access():
> +              *
> +              *    ocfs2_truncate_log_worker
> +              *      ocfs2_flush_truncate_log
> +              *        __ocfs2_flush_truncate_log
> +              *          ocfs2_replay_truncate_records
> +              *            ocfs2_journal_access_di
> +              *              __ocfs2_journal_access
> +              *
> +              * 5. Then jbd2 will flush the bh of truncate log to disk,
> +              *    but the bh of extent rec is still in error state, and
> +              *    unfortunately nobody will take care of it.
> +              * 6. At last the space of extent rec was not reduced,
> +              *    but truncate log flush worker have given it back to
> +              *    globalalloc. That will cause duplicate cluster problem
> +              *    which could be identified by fsck.ocfs2.
> +              *
> +              * Sadlly we can hardly revert this but set fs read-only
> +              * in case of ruining atomicity and consistency of space
> +              * reclaim.
>                */

It's tons of comments here and it seems the same with what's in change log.
Must we add them here?
Besides, the scenario the comment is describing is not a common case.
I think the issue will also cause some other metadata inconsistency.
So the comments will make maintainers puzzle afterwards.
I suggest to simplify it like below, for your reference:

A previous transaction with a couple buffer heads fails checkpoint, so all 
those buffer heads are marked IO_ERROR.
For current transaction, a buffer head is shared with the previous one with 
IO_ERROR.
We can't just clear IO_ERROR and continue ,since other buffer heads are not 
written to disk yet.
Above case will cause metadata inconsistency.
Just return -EORFS here and abort journal to avoid damage file system.

Thanks,
Changwei

>               if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
> -                     clear_buffer_write_io_error(bh);
> -                     set_buffer_uptodate(bh);
> -             }
> -
> -             if (!buffer_uptodate(bh)) {
>                       unlock_buffer(bh);
> -                     return -EIO;
> +                     return ocfs2_error(osb->sb, "A previous attempt to "
> +                                     "write this buffer head failed\n");
>               }
>               unlock_buffer(bh);
>       }
> 

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Reply via email to