Hi Junxiao, On 2016/1/13 15:00, Junxiao Bi wrote: > On 01/13/2016 02:21 PM, xuejiufei wrote: >> Hi Junxiao, >> I have not describe the issue clearly. >> >> Node 1 Node 2(master) >> dlmlock >> dlm_do_master_request >> dlm_master_request_handler >> -> dlm_lockres_set_refmap_bit >> dlmlock succeed >> dlmunlock succeed >> >> dlm_purge_lockres >> dlm_deref_handler >> -> find lock resource is in >> DLM_LOCK_RES_SETREF_INPROG state, >> so dispatch a deref work >> dlm_purge_lockres succeed. >> >> call dlmlock again >> dlm_do_master_request >> dlm_master_request_handler >> -> dlm_lockres_set_refmap_bit >> >> deref work trigger, call >> dlm_lockres_clear_refmap_bit >> to clear Node 1 from refmap >> >> dlm_purge_lockres succeed >> >> dlm_send_remote_lock_request >> return DLM_IVLOCKID because >> the lockres is not exist > More clear now. Thank you. > This is a very complicated race. I didn't have a good solution to fix it > now. Your fix looks work, but I am afraid if we keep going fix this > kinds of races case by case, we will make dlm harder to understand and > easy to involve bugs, maybe we should think about refactor dlm. > Agree. IMO, the root cause is bit op cannot handle such a case. I wonder if we have to change it to refcount, which may require a much bigger refactoring.
Thanks, Joseph > Thanks, > Junxiao. > >> BUG if the lockres is $RECOVERY >> >> On 2016/1/13 10:46, Junxiao Bi wrote: >>> On 01/12/2016 03:16 PM, xuejiufei wrote: >>>> Hi, Junxiao >>>> >>>> On 2016/1/12 12:03, Junxiao Bi wrote: >>>>> Hi Jiufei, >>>>> >>>>> On 01/11/2016 10:46 AM, xuejiufei wrote: >>>>>> Hi all, >>>>>> We have found a race between refmap setting and clearing which >>>>>> will cause the lock resource on master is freed before other nodes >>>>>> purge it. >>>>>> >>>>>> Node 1 Node 2(master) >>>>>> dlm_do_master_request >>>>>> dlm_master_request_handler >>>>>> -> dlm_lockres_set_refmap_bit >>>>>> call dlm_purge_lockres after unlock >>>>>> dlm_deref_handler >>>>>> -> find lock resource is in >>>>>> DLM_LOCK_RES_SETREF_INPROG state, >>>>>> so dispatch a deref work >>>>>> dlm_purge_lockres succeed. >>>>>> >>>>>> dlm_do_master_request >>>>>> dlm_master_request_handler >>>>>> -> dlm_lockres_set_refmap_bit >>>>>> >>>>>> deref work trigger, call >>>>>> dlm_lockres_clear_refmap_bit >>>>>> to clear Node 1 from refmap >>>>>> >>>>>> Now Node 2 can purge the lock resource but the owner of lock resource >>>>>> on Node 1 is still Node 2 which may trigger BUG if the lock resource >>>>>> is $RECOVERY or other problems. >>>>>> >>>>>> We have discussed 2 solutions: >>>>>> 1)The master return error to Node 1 if the DLM_LOCK_RES_SETREF_INPROG >>>>>> is set. Node 1 will not retry and master send another message to Node 1 >>>>>> after clearing the refmap. Node 1 can purge the lock resource after the >>>>>> refmap on master is cleared. >>>>>> 2) The master return error to Node 1 if the DLM_LOCK_RES_SETREF_INPROG >>>>>> is set, and Node 1 will retry to deref the lockres. >>>>>> >>>>>> Does anybody has better ideas? >>>>>> >>>>> dlm_purge_lockres() will wait to drop ref until >>>>> DLM_LOCK_RES_SETREF_INPROG cleared. So if set this flag when find the >>>>> master during doing master request. And then this flag was cleared when >>>>> receiving assert master message, can this fix the issue? >>>>> >>>> I don't think this can fix. Before doing master request, the lock resource >>>> is >>>> already purged. The master should clear the refmap before client purge it. >>> inflight_locks is increased in dlm_get_lock_resource() which will stop >>> lockres purged? Set DLM_LOCK_RES_SETREF_INPROG when found lockres owner >>> during master request, then this will stop lockres purged after unlock? >>> >>> Thanks, >>> Junxiao. >>> >>>> >>>> Thanks, >>>> Jiufei >>>> >>>>> Thanks, >>>>> Junxiao. >>>>>> Thanks, >>>>>> --Jiufei >>>>>> >>>>> >>>>> >>>>> . >>>>> >>>> >>> >>> >>> . >>> >> > > > . > _______________________________________________ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel