On 2018/3/2 10:09, piaojun wrote: > Hi Changwei, > > On 2018/3/2 9:49, Changwei Ge wrote: >> Hi Jun, >> I still have some doubts about your problematic situation description . >> Please check my reply inline your sequence diagram. >> >> On 2018/3/1 20:38, piaojun wrote: >>> Hi Changwei, >>> >>> Thanks for your quick reply, please see my comments below. >>> >>> On 2018/3/1 17:39, Changwei Ge wrote: >>>> Hi Jun, >>>> >>>> On 2018/3/1 17:27, piaojun wrote: >>>>> We should not handle migrate lockres if we are already in >>>>> 'DLM_CTXT_IN_SHUTDOWN', as that will cause lockres remains after >>>>> leaving dlm domain. At last other nodes will get stuck into infinite >>>>> loop when requsting lock from us. >>>>> >>>>> N1 N2 (owner) >>>>> touch file >>>>> >>>>> access the file, >>>>> and get pr lock >>>>> >>>>> umount >>>>> >>>> >>>> Before migrating all lock resources, N1 should have already sent >>>> DLM_BEGIN_EXIT_DOMAIN_MSG in dlm_begin_exit_domain(). >>>> N2 will set ->exit_domain_map later. >>>> So N2 can't take N1 as migration target. >>> Before receiveing N1's DLM_BEGIN_EXIT_DOMAIN_MSG, N2 has picked up N1 as >>> the migrate target. So N2 will continue sending lockres to N1 even though >>> N1 has left domain. Sorry for making you misunderstanding, I will give a >>> more detailed description. >>> >>> N1 N2 (owner) >>> touch file >>> >>> access the file, >>> and get pr lock >>> >>> begin leave domain and >>> pick up N1 as new owner >>> >>> begin leave domain and >> Here will clear N1 from N2's dlm->domain_map >> >>> migrate all lockres do > >>> begin migrate lockres to N1 >> As N1 has been cleared, migrating for this lock resource can't continue. > N2 has picked up N1 as new owner in dlm_pick_migration_target() before > setting N1 in N2's dlm->exit_domain_map, so N2 will continue migrating > lock resource.
Do you mean that DLM_BEGIN_EXIT_DOMAIN_MSG arrives just before dlm_send_one_lockres() in dlm_migrate_lockres() and after dlm_mark_lockres_migrating()? If so there is race that your problem can happen. Well let's move to your code. >> >> Besides, >> I wonder what kind of problem you encountered? >> A bug crash? A hang situation? >> It's better to share some issue appearance.So perhaps I can help to >> analyze.:) > I encountered a stuck case which described in my changelog. > > " > At last other nodes will get stuck into infinite loop when requsting lock > from us. > " > > thanks, > Jun >> >> Thanks, >> Changwei >> >>> >>> end leave domain, but >>> the lockres left >>> unexpectedly, because >>> migrate task has passed >>> >>> thanks, >>> Jun >>>> >>>> How can the scenario your changelog describing happen? >>>> Or if miss something? >>>> >>>> Thanks, >>>> Changwei >>>> >>>>> migrate all lockres >>>>> >>>>> umount and migrate lockres to N1 >>>>> >>>>> leave dlm domain, but >>>>> the lockres left >>>>> unexpectedly, because >>>>> migrate task has passed >>>>> >>>>> Signed-off-by: Jun Piao <piao...@huawei.com> >>>>> Reviewed-by: Yiwen Jiang <jiangyi...@huawei.com> >>>>> --- >>>>> fs/ocfs2/dlm/dlmdomain.c | 14 ++++++++++++++ >>>>> fs/ocfs2/dlm/dlmdomain.h | 1 + >>>>> fs/ocfs2/dlm/dlmrecovery.c | 9 +++++++++ >>>>> 3 files changed, 24 insertions(+) >>>>> >>>>> diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c >>>>> index e1fea14..3b7ec51 100644 >>>>> --- a/fs/ocfs2/dlm/dlmdomain.c >>>>> +++ b/fs/ocfs2/dlm/dlmdomain.c >>>>> @@ -675,6 +675,20 @@ static void dlm_leave_domain(struct dlm_ctxt *dlm) >>>>> spin_unlock(&dlm->spinlock); >>>>> } >>>>> >>>>> +int dlm_joined(struct dlm_ctxt *dlm) >>>>> +{ >>>>> + int ret = 0; >>>>> + >>>>> + spin_lock(&dlm_domain_lock); >>>>> + >>>>> + if (dlm->dlm_state == DLM_CTXT_JOINED) >>>>> + ret = 1; >>>>> + >>>>> + spin_unlock(&dlm_domain_lock); >>>>> + >>>>> + return ret; >>>>> +} >>>>> + How about use directly use existed function dlm_shutting_down() >>>>> int dlm_shutting_down(struct dlm_ctxt *dlm) >>>>> { >>>>> int ret = 0; >>>>> diff --git a/fs/ocfs2/dlm/dlmdomain.h b/fs/ocfs2/dlm/dlmdomain.h >>>>> index fd6122a..2f7f60b 100644 >>>>> --- a/fs/ocfs2/dlm/dlmdomain.h >>>>> +++ b/fs/ocfs2/dlm/dlmdomain.h >>>>> @@ -28,6 +28,7 @@ >>>>> extern spinlock_t dlm_domain_lock; >>>>> extern struct list_head dlm_domains; >>>>> >>>>> +int dlm_joined(struct dlm_ctxt *dlm); >>>>> int dlm_shutting_down(struct dlm_ctxt *dlm); >>>>> void dlm_fire_domain_eviction_callbacks(struct dlm_ctxt *dlm, >>>>> int node_num); >>>>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c >>>>> index ec8f758..9b3bc66 100644 >>>>> --- a/fs/ocfs2/dlm/dlmrecovery.c >>>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c >>>>> @@ -1378,6 +1378,15 @@ int dlm_mig_lockres_handler(struct o2net_msg *msg, >>>>> u32 len, void *data, >>>>> if (!dlm_grab(dlm)) >>>>> return -EINVAL; >>>>> >>>>> + if (!dlm_joined(dlm)) { Here can be if (dlm_shutting_down(dlm)) ? >>>>> + mlog(ML_ERROR, "Domain %s not joined! " The log is not accurate. Domain is shutting down will be better, I suppose. >>>>> + "lockres %.*s, master %u\n", >>>>> + dlm->name, mres->lockname_len, I don't like the alignment style. How about: mlog(ML_ERROR, "Domain %s not joined! " "lockres %.*s, master %u\n", Thanks, Changwei >>>>> + mres->lockname, mres->master); >>>>> + dlm_put(dlm); >>>>> + return -EINVAL; >>>>> + } >>>>> + >>>>> BUG_ON(!(mres->flags & (DLM_MRES_RECOVERY|DLM_MRES_MIGRATION))); >>>>> >>>>> real_master = mres->master; >>>>> >>>> . >>>> >>> >> . >> > _______________________________________________ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel