Hi Changwei, Thanks for quick repling, Gang and I are thinking about how to notice user to recover this problem, and later I will post patch v2.
thanks Jun On 2018/1/27 13:17, Changwei Ge wrote: > Hi Jun, > > On 2018/1/27 11:52, piaojun wrote: >> Hi Jan and Changwei, >> >> I will describle the scenario again as below: >> >> 1. The bhs of truncate log and extent rec are submitted to jbd2 area >> successfully. >> 2. Then write-back thread of device help flush the bhs to disk but >> encounter write error due to abnormal storage link. > > Now, your problem description makes sense. > It seems you have withdrawn your last version of patch from -mm tree. > I will look at your next version. > > Thanks, > Changwei > >> 3. After a while the storage link become normal. Truncate log flush >> worker triggered by the next space reclaiming found the dirty bh of >> truncate log and clear its 'BH_Write_EIO' and then set it uptodate >> in __ocfs2_journal_access(). >> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of >> extent rec is still in error state, and unfortunately nobody will >> take care of it. >> 5. At last the space of extent rec was not reduced, but truncate log >> flush worker have given it back to globalalloc. That will cause >> duplicate cluster problem which could be identified by fsck.ocfs2. >> >> I suggest ocfs2 should handle this problem. >> >> thanks, >> Jun >> >> On 2018/1/26 10:03, Changwei Ge wrote: >>> On 2018/1/26 9:45, piaojun wrote: >>>> Hi Changwei, >>>> >>>> On 2018/1/26 9:00, Changwei Ge wrote: >>>>> Hi Jun, >>>>> Good morning:-) >>>>> >>>>> On 2018/1/25 20:45, piaojun wrote: >>>>>> Hi Changwei, >>>>>> >>>>>> On 2018/1/25 20:30, Changwei Ge wrote: >>>>>>> Hi Jun, >>>>>>> >>>>>>> On 2018/1/25 20:18, piaojun wrote: >>>>>>>> Hi Changwei, >>>>>>>> >>>>>>>> On 2018/1/25 19:59, Changwei Ge wrote: >>>>>>>>> Hi Jun, >>>>>>>>> >>>>>>>>> On 2018/1/25 10:41, piaojun wrote: >>>>>>>>>> We should not reuse the dirty bh in jbd2 directly due to the >>>>>>>>>> following >>>>>>>>>> situation: >>>>>>>>>> >>>>>>>>>> 1. When removing extent rec, we will dirty the bhs of extent rec and >>>>>>>>> Quick questions: >>>>>>>>> Do you mean current code puts modifying extent records belonging to a >>>>>>>>> certain file and modifying truncate log into the same transaction? >>>>>>>>> If so, forgive me since I didn't figure it out. Could you point out >>>>>>>>> in your following sequence diagram? >>>>>>>>> >>>>>>>>> Afterwards, I can understand the issue your change log is describing >>>>>>>>> better. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Changwei >>>>>>>>> >>>>>>>> Yes, I mean they are in the same transaction as below: >>>>>>>> >>>>>>>> ocfs2_remove_btree_range >>>>>>>> ocfs2_remove_extent // modify extent records >>>>>>>> ocfs2_truncate_log_append // modify truncate log >>>>>>> >>>>>>> If so I think the transaction including operations on extent and >>>>>>> truncate log won't be committed. >>>>>>> And journal should already be aborted if interval transaction commit >>>>>>> thread has been woken. >>>>>>> So no metadata will be changed. >>>>>>> And later, ocfs2_truncate_log_worker shouldn't see any inode on >>>>>>> truncate log. >>>>>>> Are you sure this is the root cause of your problem? >>>>>>> I feel a little strange for it. >>>>>>> >>>>>>> Thanks, >>>>>>> Changwei >>>>>>> >>>>>> As you said, the transaction was not committed, but after a while the >>>>>> bh of truncate log was committed in another transaction. I'm sure for >>>>>> the cause and after applying this patch, the duplicate cluster problem >>>>>> is gone. I have tested it a few month. >>>>> >>>>> I think we are talking about two jbd2/transactions. right? >>>> yes, two transactions involved. >>>> >>>>> One is for moving clusters from extent to truncate log. Let's name it T1. >>>>> Anther is for declaiming clusters from truncate log and returning them >>>>> back to global bitmap. Let's name it T2. >>>>> >>>>> If jbd2 fails to commit T1 due to an IO error, the whole jbd2/journal >>>>> will be aborted which means it can't work any more. >>>>> All following starting transaction and commit transaction will fail. >>>>> >>>>> So, how can the T2 be committed while T1 fails? >>>> From my testing jbd2 won't be aborted when encounter IO error, and I >>>> print the bh->b_state = 0x44828 = 1000100100000101000. That means the >>>> bh has been submitted but write IO, and still in jbd2 according to >>>> 'bh_state_bits' and 'jbd_state_bits'. >>> >>> Um... Strange :( >>> T1 fails to be committed but journal is still normal? >>> The T2 is even committed successfully? >>> >>> I add Jan to our discussion. >>> Hopefully, he can help clear our doubts. :) >>> >>>> >>>>> >>>>> Otherwise, did you ever try to recover jbd2/journal? If so, I think your >>>>> patch here is not fit for mainline yet. >>>>> >>>> Currently this problem needs user umount and mount again to recover, >>>> and I'm glad to hear your advice. >>>> >>> >>> I think it's better to do so for now. >>> Moreover, ocfs2 will fence the problematic node out. >>> >>> Thanks, >>> Changwei >>> >>>> thanks, >>>> Jun >>>> >>>>> Thanks, >>>>> Changwei >>>>> >>>> >>>>>> >>>>>> thanks, >>>>>> Jun >>>>>> >>>>>>>> >>>>>>>> thanks, >>>>>>>> Jun >>>>>>>> >>>>>>>>>> truncate log at the same time, and hand them over to jbd2. >>>>>>>>>> 2. The bhs are not flushed to disk due to abnormal storage link. >>>>>>>>>> 3. After a while the storage link become normal. Truncate log flush >>>>>>>>>> worker triggered by the next space reclaiming found the >>>>>>>>>> dirty bh of >>>>>>>>>> truncate log and clear its 'BH_Write_EIO' and then set it >>>>>>>>>> uptodate >>>>>>>>>> in __ocfs2_journal_access(): >>>>>>>>>> >>>>>>>>>> ocfs2_truncate_log_worker >>>>>>>>>> ocfs2_flush_truncate_log >>>>>>>>>> __ocfs2_flush_truncate_log >>>>>>>>>> ocfs2_replay_truncate_records >>>>>>>>>> ocfs2_journal_access_di >>>>>>>>>> __ocfs2_journal_access // here we clear io_error and >>>>>>>>>> set 'tl_bh' uptodata. >>>>>>>>>> >>>>>>>>>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of >>>>>>>>>> extent rec is still in error state, and unfortunately nobody >>>>>>>>>> will >>>>>>>>>> take care of it. >>>>>>>>>> 5. At last the space of extent rec was not reduced, but truncate log >>>>>>>>>> flush worker have given it back to globalalloc. That will >>>>>>>>>> cause >>>>>>>>>> duplicate cluster problem which could be identified by >>>>>>>>>> fsck.ocfs2. >>>>>>>>>> >>>>>>>>>> So we should return -EIO in case of ruining atomicity and consistency >>>>>>>>>> of space reclaim. >>>>>>>>>> >>>>>>>>>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in >>>>>>>>>> __ocfs2_journal_access") >>>>>>>>>> >>>>>>>>>> Signed-off-by: Jun Piao <piao...@huawei.com> >>>>>>>>>> Reviewed-by: Yiwen Jiang <jiangyi...@huawei.com> >>>>>>>>>> --- >>>>>>>>>> fs/ocfs2/journal.c | 45 >>>>>>>>>> +++++++++++++++++++++++++++++++++++---------- >>>>>>>>>> 1 file changed, 35 insertions(+), 10 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c >>>>>>>>>> index 3630443..d769ca2 100644 >>>>>>>>>> --- a/fs/ocfs2/journal.c >>>>>>>>>> +++ b/fs/ocfs2/journal.c >>>>>>>>>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t >>>>>>>>>> *handle, >>>>>>>>>> /* we can safely remove this assertion after testing. */ >>>>>>>>>> if (!buffer_uptodate(bh)) { >>>>>>>>>> mlog(ML_ERROR, "giving me a buffer that's not >>>>>>>>>> uptodate!\n"); >>>>>>>>>> - mlog(ML_ERROR, "b_blocknr=%llu\n", >>>>>>>>>> - (unsigned long long)bh->b_blocknr); >>>>>>>>>> + mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n", >>>>>>>>>> + (unsigned long long)bh->b_blocknr, bh->b_state); >>>>>>>>>> >>>>>>>>>> lock_buffer(bh); >>>>>>>>>> /* >>>>>>>>>> - * A previous attempt to write this buffer head failed. >>>>>>>>>> - * Nothing we can do but to retry the write and hope for >>>>>>>>>> - * the best. >>>>>>>>>> + * We should not reuse the dirty bh directly due to the >>>>>>>>>> + * following situation: >>>>>>>>>> + * >>>>>>>>>> + * 1. When removing extent rec, we will dirty the bhs of >>>>>>>>>> + * extent rec and truncate log at the same time, and >>>>>>>>>> + * hand them over to jbd2. >>>>>>>>>> + * 2. The bhs are not flushed to disk due to abnormal >>>>>>>>>> + * storage link. >>>>>>>>>> + * 3. After a while the storage link become normal. >>>>>>>>>> + * Truncate log flush worker triggered by the next >>>>>>>>>> + * space reclaiming found the dirty bh of truncate >>>>>>>>>> log >>>>>>>>>> + * and clear its 'BH_Write_EIO' and then set it >>>>>>>>>> uptodate >>>>>>>>>> + * in __ocfs2_journal_access(): >>>>>>>>>> + * >>>>>>>>>> + * ocfs2_truncate_log_worker >>>>>>>>>> + * ocfs2_flush_truncate_log >>>>>>>>>> + * __ocfs2_flush_truncate_log >>>>>>>>>> + * ocfs2_replay_truncate_records >>>>>>>>>> + * ocfs2_journal_access_di >>>>>>>>>> + * __ocfs2_journal_access >>>>>>>>>> + * >>>>>>>>>> + * 4. Then jbd2 will flush the bh of truncate log to >>>>>>>>>> disk, >>>>>>>>>> + * but the bh of extent rec is still in error state, >>>>>>>>>> and >>>>>>>>>> + * unfortunately nobody will take care of it. >>>>>>>>>> + * 5. At last the space of extent rec was not reduced, >>>>>>>>>> + * but truncate log flush worker have given it back >>>>>>>>>> to >>>>>>>>>> + * globalalloc. That will cause duplicate cluster >>>>>>>>>> problem >>>>>>>>>> + * which could be identified by fsck.ocfs2. >>>>>>>>>> + * >>>>>>>>>> + * So we should return -EIO in case of ruining atomicity >>>>>>>>>> + * and consistency of space reclaim. >>>>>>>>>> */ >>>>>>>>>> if (buffer_write_io_error(bh) && >>>>>>>>>> !buffer_uptodate(bh)) { >>>>>>>>>> - clear_buffer_write_io_error(bh); >>>>>>>>>> - set_buffer_uptodate(bh); >>>>>>>>>> - } >>>>>>>>>> - >>>>>>>>>> - if (!buffer_uptodate(bh)) { >>>>>>>>>> + mlog(ML_ERROR, "A previous attempt to write >>>>>>>>>> this " >>>>>>>>>> + "buffer head failed\n"); >>>>>>>>>> unlock_buffer(bh); >>>>>>>>>> return -EIO; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>> . >>>>>>>>> >>>>>>>> >>>>>>> . >>>>>>> >>>>>> >>>>> . >>>>> >>>> >>> . >>> >> > . > _______________________________________________ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel