Re: [Ocfs2-devel] [PATCH] remove lockres from purge list when we are getting it for creating lock
Srini, On 11-06-08 22:17, Srinivas Eeda wrote: I think I have seen this problem in ocfs2-1.2 and it was addressed by using a new state DLM_LOCK_RES_IN_USE. But we didn't merge into mainline as sunil suggested we need to look for a different approach http://oss.oracle.com/pipermail/ocfs2-devel/2010-June/006669.html http://www.mail-archive.com/ocfs2-devel@oss.oracle.com/msg05733.html Yes, this is the same problem. one comment inline ..snip +spin_unlock(tmpres-spinlock); +spin_unlock(dlm-spinlock); lockres could still get added to purgelist at this point and we could still have the same problem? I think, here we need some mechanism that marks the lockres is in use that would protect it from adding to the purgelist? Seems there shouldn't be the possibility that the lockres was moved to purge list again. We are in the case of a create lock, there should not be any outstanding convert/unlock request in progress concurrently. If there is one, the there must be lock(s) attached to the lockres. But if there are locks attached, the lockres can't be purged. Also for concurrent create lock, there is no problem either because there must be a lock attached to the lockres. I think your patch works too. But changing DLM_LOCK_RES_IN_USE to DLM_LOCK_RES_CREATING_LOCK can help understand better. I also thought about inflight_locks, but so far I failed to find correct places to drop it. thanks, wengang. if (res) dlm_lockres_put(res); res = tmpres; ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] remove lockres from purge list when we are getting it for creating lock
+ spin_unlock(tmpres-spinlock); + spin_unlock(dlm-spinlock); lockres could still get added to purgelist at this point and we could still have the same problem? I think, here we need some mechanism that marks the lockres is in use that would protect it from adding to the purgelist? Seems there shouldn't be the possibility that the lockres was moved to purge list again. I think you are right about that once removed from the purge list it will not be put back again on the purgelist in this case :) I still think that there is a hole here. Is it possible that the spinlocks are released here and right then dlm_thread was walking through dirty list and then it could put this on purge list for the first time? If it's already on purgelist, then we seem to be safe by removing it. But what is preventing the lockres getting onto to purge list right at this point. locks are added to lockres a little later but till then it is not safe?(I think). We are in the case of a create lock, there should not be any outstanding convert/unlock request in progress concurrently. If there is one, the there must be lock(s) attached to the lockres. But if there are locks attached, the lockres can't be purged. Also for concurrent create lock, there is no problem either because there must be a lock attached to the lockres. I think your patch works too. But changing DLM_LOCK_RES_IN_USE to DLM_LOCK_RES_CREATING_LOCK can help understand better. I am ok with the name :). The thing we should do here is once we found a lockres and moving ahead with create lock request, we should be sure it's protected before releasing the spinlocks. I also thought about inflight_locks, but so far I failed to find correct places to drop it. thanks, wengang. if (res) dlm_lockres_put(res); res = tmpres; ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] remove lockres from purge list when we are getting it for creating lock
Srini, Yes, you are right! Agree! Sunil, My privious patch which checks if lockres is unhashed can't fix the problem that can't find lockres in ast handler. So I vote for Srini's patch. what's your comment? thanks, wengang. On 11-06-09 00:43, Srinivas Eeda wrote: + spin_unlock(tmpres-spinlock); + spin_unlock(dlm-spinlock); lockres could still get added to purgelist at this point and we could still have the same problem? I think, here we need some mechanism that marks the lockres is in use that would protect it from adding to the purgelist? Seems there shouldn't be the possibility that the lockres was moved to purge list again. I think you are right about that once removed from the purge list it will not be put back again on the purgelist in this case :) I still think that there is a hole here. Is it possible that the spinlocks are released here and right then dlm_thread was walking through dirty list and then it could put this on purge list for the first time? If it's already on purgelist, then we seem to be safe by removing it. But what is preventing the lockres getting onto to purge list right at this point. locks are added to lockres a little later but till then it is not safe?(I think). We are in the case of a create lock, there should not be any outstanding convert/unlock request in progress concurrently. If there is one, the there must be lock(s) attached to the lockres. But if there are locks attached, the lockres can't be purged. Also for concurrent create lock, there is no problem either because there must be a lock attached to the lockres. I think your patch works too. But changing DLM_LOCK_RES_IN_USE to DLM_LOCK_RES_CREATING_LOCK can help understand better. I am ok with the name :). The thing we should do here is once we found a lockres and moving ahead with create lock request, we should be sure it's protected before releasing the spinlocks. I also thought about inflight_locks, but so far I failed to find correct places to drop it. thanks, wengang. if (res) dlm_lockres_put(res); res = tmpres; ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] remove lockres from purge list when we are getting it for creating lock
On 06/08/2011 03:04 AM, Wengang Wang wrote: When we are to purge a lockres, we check if it's unused. the check includes 1. no locks attached to the lockres. 2. not dirty. 3. not in recovery. 4. no interested nodes in refmap. If the the above four are satisfied, we are going to purge it(remove it from hash table). While, when a create lock is in progress especially when lockres is owned remotely(spinlocks are dropped when networking), the lockres can satisfy above four condition. If it's put to purge list, it can be purged out from hash table when we are still accessing on it(sending request to owner for example). That's not what we want. I have met such a problem (orabug 12405575). The lockres is in the purge list already(there is a delay for real purge work) and the create lock request comes. When we are sending network message to the owner in dlmlock_remote(), the owner crashed. So we get DLM_RECOVERING as return value and retries dlmlock_remote(). And before the owner crash, we have purged the lockres. So the lockres become stale(on lockres-onwer). Thus the code calls dlmlock_remote() infinitely. fix: we remove the lockres from purge list if it's there in dlm_get_lock_resource() which is called for only createlock case. So that the lockres can't be purged when we are in progress of createlock. Signed-off-by: Wengang Wangwen.gang.w...@oracle.com --- fs/ocfs2/dlm/dlmmaster.c | 41 + 1 files changed, 29 insertions(+), 12 deletions(-) diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c index 11eefb8..511d43c 100644 --- a/fs/ocfs2/dlm/dlmmaster.c +++ b/fs/ocfs2/dlm/dlmmaster.c @@ -709,28 +709,27 @@ lookup: spin_lock(dlm-spinlock); tmpres = __dlm_lookup_lockres_full(dlm, lockid, namelen, hash); if (tmpres) { - int dropping_ref = 0; - - spin_unlock(dlm-spinlock); - spin_lock(tmpres-spinlock); /* We wait for the other thread that is mastering the resource */ if (tmpres-owner == DLM_LOCK_RES_OWNER_UNKNOWN) { + spin_unlock(dlm-spinlock); __dlm_wait_on_lockres(tmpres); BUG_ON(tmpres-owner == DLM_LOCK_RES_OWNER_UNKNOWN); + spin_unlock(tmpres-spinlock); + dlm_lockres_put(tmpres); + tmpres = NULL; + goto lookup; } if (tmpres-owner == dlm-node_num) { BUG_ON(tmpres-state DLM_LOCK_RES_DROPPING_REF); dlm_lockres_grab_inflight_ref(dlm, tmpres); - } else if (tmpres-state DLM_LOCK_RES_DROPPING_REF) - dropping_ref = 1; - spin_unlock(tmpres-spinlock); - - /* wait until done messaging the master, drop our ref to allow - * the lockres to be purged, start over. */ - if (dropping_ref) { - spin_lock(tmpres-spinlock); + } else if (tmpres-state DLM_LOCK_RES_DROPPING_REF) { + /* + * wait until done messaging the master, drop our ref to + * allow the lockres to be purged, start over. + */ + spin_unlock(dlm-spinlock); __dlm_wait_on_lockres_flags(tmpres, DLM_LOCK_RES_DROPPING_REF); spin_unlock(tmpres-spinlock); dlm_lockres_put(tmpres); @@ -739,6 +738,24 @@ lookup: } mlog(0, found in hash!\n); + /* + * we are going to do a create-lock next. so remove the lockres + * from purge list to avoid the case that we will access on the + * lockres which is already purged out from hash table in + * dlm_run_purge_list() path. + * otherwise, we could run into a problem: + * the owner dead(recovery didn't take care of this lockres + * since it's not in hashtable), and the code keeps sending + * request to the dead node and getting DLM_RECOVERING and + * then retrying infinitely. + */ + if (!list_empty(tmpres-purge)) { + list_del_init(tmpres-purge); + dlm_lockres_put(tmpres); + } + + spin_unlock(tmpres-spinlock); + spin_unlock(dlm-spinlock); if (res) dlm_lockres_put(res); res = tmpres; In short, you are holding onto the dlm-spinlock a bit longer and forcibly removing the lockres from the purgelist. I have two problems with this patch. Firstly it ignores the fact that the resource can be added to the purgelist right after we drop the dlm-spinlock. There is nothing to protect against that. And
Re: [Ocfs2-devel] [PATCH] remove lockres from purge list when we are getting it for creating lock
On 06/09/2011 06:01 PM, Wengang Wang wrote: Yes, you are right. There is such a problem that the lockres can be added to purge list after we drop dlm-spinlock. I am not sure if it's the more likely case since there is a 8 seconds delay between the the time the lockres is added to purge list and the time it is purged. I guess in normal case, the create- lock should already finished in the 8 seconds. I considered about the inflight_locks. But I didn't think it well so far. Because simply moving the lockres out from purge list is not good enough, I will take a good think about making use of inflight_locks. Secondly, we currently manipulate the purgelist in one function only. __dlm_calc_lockres_usage(). We should stick to that. If we make good use of inflight_locks, I think we can. BTW, how are you testing this? I tested with the steps which can cause the original problem to prove it works.(but without UT). Also I made a regression test with the existing ocfs2-test suites(the multiple-xxx with reducing interations). What's UT? I am looking for is a testcase that reliably reproduces the problem. Not necessarily that can be checked in. But something that triggers the issue. ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel