Hi Jun, On 2018/1/27 11:52, piaojun wrote: > Hi Jan and Changwei, > > I will describle the scenario again as below: > > 1. The bhs of truncate log and extent rec are submitted to jbd2 area > successfully. > 2. Then write-back thread of device help flush the bhs to disk but > encounter write error due to abnormal storage link.
Now, your problem description makes sense. It seems you have withdrawn your last version of patch from -mm tree. I will look at your next version. Thanks, Changwei > 3. After a while the storage link become normal. Truncate log flush > worker triggered by the next space reclaiming found the dirty bh of > truncate log and clear its 'BH_Write_EIO' and then set it uptodate > in __ocfs2_journal_access(). > 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of > extent rec is still in error state, and unfortunately nobody will > take care of it. > 5. At last the space of extent rec was not reduced, but truncate log > flush worker have given it back to globalalloc. That will cause > duplicate cluster problem which could be identified by fsck.ocfs2. > > I suggest ocfs2 should handle this problem. > > thanks, > Jun > > On 2018/1/26 10:03, Changwei Ge wrote: >> On 2018/1/26 9:45, piaojun wrote: >>> Hi Changwei, >>> >>> On 2018/1/26 9:00, Changwei Ge wrote: >>>> Hi Jun, >>>> Good morning:-) >>>> >>>> On 2018/1/25 20:45, piaojun wrote: >>>>> Hi Changwei, >>>>> >>>>> On 2018/1/25 20:30, Changwei Ge wrote: >>>>>> Hi Jun, >>>>>> >>>>>> On 2018/1/25 20:18, piaojun wrote: >>>>>>> Hi Changwei, >>>>>>> >>>>>>> On 2018/1/25 19:59, Changwei Ge wrote: >>>>>>>> Hi Jun, >>>>>>>> >>>>>>>> On 2018/1/25 10:41, piaojun wrote: >>>>>>>>> We should not reuse the dirty bh in jbd2 directly due to the following >>>>>>>>> situation: >>>>>>>>> >>>>>>>>> 1. When removing extent rec, we will dirty the bhs of extent rec and >>>>>>>> Quick questions: >>>>>>>> Do you mean current code puts modifying extent records belonging to a >>>>>>>> certain file and modifying truncate log into the same transaction? >>>>>>>> If so, forgive me since I didn't figure it out. Could you point out in >>>>>>>> your following sequence diagram? >>>>>>>> >>>>>>>> Afterwards, I can understand the issue your change log is describing >>>>>>>> better. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Changwei >>>>>>>> >>>>>>> Yes, I mean they are in the same transaction as below: >>>>>>> >>>>>>> ocfs2_remove_btree_range >>>>>>> ocfs2_remove_extent // modify extent records >>>>>>> ocfs2_truncate_log_append // modify truncate log >>>>>> >>>>>> If so I think the transaction including operations on extent and >>>>>> truncate log won't be committed. >>>>>> And journal should already be aborted if interval transaction commit >>>>>> thread has been woken. >>>>>> So no metadata will be changed. >>>>>> And later, ocfs2_truncate_log_worker shouldn't see any inode on truncate >>>>>> log. >>>>>> Are you sure this is the root cause of your problem? >>>>>> I feel a little strange for it. >>>>>> >>>>>> Thanks, >>>>>> Changwei >>>>>> >>>>> As you said, the transaction was not committed, but after a while the >>>>> bh of truncate log was committed in another transaction. I'm sure for >>>>> the cause and after applying this patch, the duplicate cluster problem >>>>> is gone. I have tested it a few month. >>>> >>>> I think we are talking about two jbd2/transactions. right? >>> yes, two transactions involved. >>> >>>> One is for moving clusters from extent to truncate log. Let's name it T1. >>>> Anther is for declaiming clusters from truncate log and returning them >>>> back to global bitmap. Let's name it T2. >>>> >>>> If jbd2 fails to commit T1 due to an IO error, the whole jbd2/journal will >>>> be aborted which means it can't work any more. >>>> All following starting transaction and commit transaction will fail. >>>> >>>> So, how can the T2 be committed while T1 fails? >>> From my testing jbd2 won't be aborted when encounter IO error, and I >>> print the bh->b_state = 0x44828 = 1000100100000101000. That means the >>> bh has been submitted but write IO, and still in jbd2 according to >>> 'bh_state_bits' and 'jbd_state_bits'. >> >> Um... Strange :( >> T1 fails to be committed but journal is still normal? >> The T2 is even committed successfully? >> >> I add Jan to our discussion. >> Hopefully, he can help clear our doubts. :) >> >>> >>>> >>>> Otherwise, did you ever try to recover jbd2/journal? If so, I think your >>>> patch here is not fit for mainline yet. >>>> >>> Currently this problem needs user umount and mount again to recover, >>> and I'm glad to hear your advice. >>> >> >> I think it's better to do so for now. >> Moreover, ocfs2 will fence the problematic node out. >> >> Thanks, >> Changwei >> >>> thanks, >>> Jun >>> >>>> Thanks, >>>> Changwei >>>> >>> >>>>> >>>>> thanks, >>>>> Jun >>>>> >>>>>>> >>>>>>> thanks, >>>>>>> Jun >>>>>>> >>>>>>>>> truncate log at the same time, and hand them over to jbd2. >>>>>>>>> 2. The bhs are not flushed to disk due to abnormal storage link. >>>>>>>>> 3. After a while the storage link become normal. Truncate log flush >>>>>>>>> worker triggered by the next space reclaiming found the dirty >>>>>>>>> bh of >>>>>>>>> truncate log and clear its 'BH_Write_EIO' and then set it >>>>>>>>> uptodate >>>>>>>>> in __ocfs2_journal_access(): >>>>>>>>> >>>>>>>>> ocfs2_truncate_log_worker >>>>>>>>> ocfs2_flush_truncate_log >>>>>>>>> __ocfs2_flush_truncate_log >>>>>>>>> ocfs2_replay_truncate_records >>>>>>>>> ocfs2_journal_access_di >>>>>>>>> __ocfs2_journal_access // here we clear io_error and >>>>>>>>> set 'tl_bh' uptodata. >>>>>>>>> >>>>>>>>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of >>>>>>>>> extent rec is still in error state, and unfortunately nobody >>>>>>>>> will >>>>>>>>> take care of it. >>>>>>>>> 5. At last the space of extent rec was not reduced, but truncate log >>>>>>>>> flush worker have given it back to globalalloc. That will >>>>>>>>> cause >>>>>>>>> duplicate cluster problem which could be identified by >>>>>>>>> fsck.ocfs2. >>>>>>>>> >>>>>>>>> So we should return -EIO in case of ruining atomicity and consistency >>>>>>>>> of space reclaim. >>>>>>>>> >>>>>>>>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in >>>>>>>>> __ocfs2_journal_access") >>>>>>>>> >>>>>>>>> Signed-off-by: Jun Piao <piao...@huawei.com> >>>>>>>>> Reviewed-by: Yiwen Jiang <jiangyi...@huawei.com> >>>>>>>>> --- >>>>>>>>> fs/ocfs2/journal.c | 45 >>>>>>>>> +++++++++++++++++++++++++++++++++++---------- >>>>>>>>> 1 file changed, 35 insertions(+), 10 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c >>>>>>>>> index 3630443..d769ca2 100644 >>>>>>>>> --- a/fs/ocfs2/journal.c >>>>>>>>> +++ b/fs/ocfs2/journal.c >>>>>>>>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t >>>>>>>>> *handle, >>>>>>>>> /* we can safely remove this assertion after testing. */ >>>>>>>>> if (!buffer_uptodate(bh)) { >>>>>>>>> mlog(ML_ERROR, "giving me a buffer that's not >>>>>>>>> uptodate!\n"); >>>>>>>>> - mlog(ML_ERROR, "b_blocknr=%llu\n", >>>>>>>>> - (unsigned long long)bh->b_blocknr); >>>>>>>>> + mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n", >>>>>>>>> + (unsigned long long)bh->b_blocknr, bh->b_state); >>>>>>>>> >>>>>>>>> lock_buffer(bh); >>>>>>>>> /* >>>>>>>>> - * A previous attempt to write this buffer head failed. >>>>>>>>> - * Nothing we can do but to retry the write and hope for >>>>>>>>> - * the best. >>>>>>>>> + * We should not reuse the dirty bh directly due to the >>>>>>>>> + * following situation: >>>>>>>>> + * >>>>>>>>> + * 1. When removing extent rec, we will dirty the bhs of >>>>>>>>> + * extent rec and truncate log at the same time, and >>>>>>>>> + * hand them over to jbd2. >>>>>>>>> + * 2. The bhs are not flushed to disk due to abnormal >>>>>>>>> + * storage link. >>>>>>>>> + * 3. After a while the storage link become normal. >>>>>>>>> + * Truncate log flush worker triggered by the next >>>>>>>>> + * space reclaiming found the dirty bh of truncate >>>>>>>>> log >>>>>>>>> + * and clear its 'BH_Write_EIO' and then set it >>>>>>>>> uptodate >>>>>>>>> + * in __ocfs2_journal_access(): >>>>>>>>> + * >>>>>>>>> + * ocfs2_truncate_log_worker >>>>>>>>> + * ocfs2_flush_truncate_log >>>>>>>>> + * __ocfs2_flush_truncate_log >>>>>>>>> + * ocfs2_replay_truncate_records >>>>>>>>> + * ocfs2_journal_access_di >>>>>>>>> + * __ocfs2_journal_access >>>>>>>>> + * >>>>>>>>> + * 4. Then jbd2 will flush the bh of truncate log to >>>>>>>>> disk, >>>>>>>>> + * but the bh of extent rec is still in error state, >>>>>>>>> and >>>>>>>>> + * unfortunately nobody will take care of it. >>>>>>>>> + * 5. At last the space of extent rec was not reduced, >>>>>>>>> + * but truncate log flush worker have given it back >>>>>>>>> to >>>>>>>>> + * globalalloc. That will cause duplicate cluster >>>>>>>>> problem >>>>>>>>> + * which could be identified by fsck.ocfs2. >>>>>>>>> + * >>>>>>>>> + * So we should return -EIO in case of ruining atomicity >>>>>>>>> + * and consistency of space reclaim. >>>>>>>>> */ >>>>>>>>> if (buffer_write_io_error(bh) && >>>>>>>>> !buffer_uptodate(bh)) { >>>>>>>>> - clear_buffer_write_io_error(bh); >>>>>>>>> - set_buffer_uptodate(bh); >>>>>>>>> - } >>>>>>>>> - >>>>>>>>> - if (!buffer_uptodate(bh)) { >>>>>>>>> + mlog(ML_ERROR, "A previous attempt to write >>>>>>>>> this " >>>>>>>>> + "buffer head failed\n"); >>>>>>>>> unlock_buffer(bh); >>>>>>>>> return -EIO; >>>>>>>>> } >>>>>>>>> >>>>>>>> . >>>>>>>> >>>>>>> >>>>>> . >>>>>> >>>>> >>>> . >>>> >>> >> . >> > _______________________________________________ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel