Re: [Ocfs2-devel] [PATCH] remove lockres from purge list when we are getting it for creating lock

2011-06-09 Thread Wengang Wang
Srini,

On 11-06-08 22:17, Srinivas Eeda wrote:
 I think I have seen this problem in ocfs2-1.2 and it was addressed
 by using a new state DLM_LOCK_RES_IN_USE. But we didn't merge into
 mainline as sunil suggested we need to look for a different approach
 
 http://oss.oracle.com/pipermail/ocfs2-devel/2010-June/006669.html
 http://www.mail-archive.com/ocfs2-devel@oss.oracle.com/msg05733.html

Yes, this is the same problem.
 
 one comment inline
..snip
 +spin_unlock(tmpres-spinlock);
 +spin_unlock(dlm-spinlock);
 lockres could still get added to purgelist at this point and we
 could still have the same problem? I think, here we need some
 mechanism that marks the lockres is in use that would protect it
 from adding to the purgelist?

Seems there shouldn't be the possibility that the lockres was moved to purge
list again. We are in the case of a create lock, there should not be any
outstanding convert/unlock request in progress concurrently. If there is one,
the there must be lock(s) attached to the lockres. But if there are locks
attached, the lockres can't be purged. Also for concurrent create lock, there
is no problem either because there must be a lock attached to the lockres. 

I think your patch works too. But changing
DLM_LOCK_RES_IN_USE
to
DLM_LOCK_RES_CREATING_LOCK
can help understand better.

I also thought about inflight_locks, but so far I failed to find correct
places to drop it.

thanks,
wengang.
  if (res)
  dlm_lockres_put(res);
  res = tmpres;

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] remove lockres from purge list when we are getting it for creating lock

2011-06-09 Thread Srinivas Eeda

 +   spin_unlock(tmpres-spinlock);
 +   spin_unlock(dlm-spinlock);
 lockres could still get added to purgelist at this point and we
 could still have the same problem? I think, here we need some
 mechanism that marks the lockres is in use that would protect it
 from adding to the purgelist?
 Seems there shouldn't be the possibility that the lockres was moved to purge
 list again.
I think you are right about that once removed from the purge list it 
will not be put back again on the purgelist in this case :)

I still think that there is a hole here. Is it possible that the 
spinlocks are released here and right then dlm_thread was walking 
through dirty list and then it could put this on purge list for the 
first time?

If it's already on purgelist, then we seem to be safe by removing it. 
But what is preventing the lockres getting onto to purge list right at 
this point. locks are added to lockres a little later but till then it 
is not safe?(I think).

   We are in the case of a create lock, there should not be any
 outstanding convert/unlock request in progress concurrently. If there is one,
 the there must be lock(s) attached to the lockres. But if there are locks
 attached, the lockres can't be purged. Also for concurrent create lock, 
 there
 is no problem either because there must be a lock attached to the lockres.

 I think your patch works too. But changing
 DLM_LOCK_RES_IN_USE
 to
 DLM_LOCK_RES_CREATING_LOCK
 can help understand better.
I am ok with the name :). The thing we should do here is once we found a 
lockres and moving ahead with create lock request, we should be sure 
it's protected before releasing the spinlocks.
 I also thought about inflight_locks, but so far I failed to find correct
 places to drop it.

 thanks,
 wengang.
 if (res)
 dlm_lockres_put(res);
 res = tmpres;

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] remove lockres from purge list when we are getting it for creating lock

2011-06-09 Thread Wengang Wang
Srini,
Yes, you are right! Agree!

Sunil,
My privious patch which checks if lockres is unhashed can't fix the
problem that can't find lockres in ast handler. So I vote for Srini's
patch.
what's your comment?

thanks,
wengang.
On 11-06-09 00:43, Srinivas Eeda wrote:
 
 +  spin_unlock(tmpres-spinlock);
 +  spin_unlock(dlm-spinlock);
 lockres could still get added to purgelist at this point and we
 could still have the same problem? I think, here we need some
 mechanism that marks the lockres is in use that would protect it
 from adding to the purgelist?
 Seems there shouldn't be the possibility that the lockres was moved to purge
 list again.
 I think you are right about that once removed from the purge list it
 will not be put back again on the purgelist in this case :)
 
 I still think that there is a hole here. Is it possible that the
 spinlocks are released here and right then dlm_thread was walking
 through dirty list and then it could put this on purge list for the
 first time?
 
 If it's already on purgelist, then we seem to be safe by removing
 it. But what is preventing the lockres getting onto to purge list
 right at this point. locks are added to lockres a little later but
 till then it is not safe?(I think).
 
   We are in the case of a create lock, there should not be any
 outstanding convert/unlock request in progress concurrently. If there is one,
 the there must be lock(s) attached to the lockres. But if there are locks
 attached, the lockres can't be purged. Also for concurrent create lock, 
 there
 is no problem either because there must be a lock attached to the lockres.
 
 I think your patch works too. But changing
 DLM_LOCK_RES_IN_USE
 to
 DLM_LOCK_RES_CREATING_LOCK
 can help understand better.
 I am ok with the name :). The thing we should do here is once we
 found a lockres and moving ahead with create lock request, we should
 be sure it's protected before releasing the spinlocks.
 I also thought about inflight_locks, but so far I failed to find correct
 places to drop it.
 
 thanks,
 wengang.
if (res)
dlm_lockres_put(res);
res = tmpres;

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] remove lockres from purge list when we are getting it for creating lock

2011-06-09 Thread Sunil Mushran
On 06/08/2011 03:04 AM, Wengang Wang wrote:
 When we are to purge a lockres, we check if it's unused.
 the check includes
 1. no locks attached to the lockres.
 2. not dirty.
 3. not in recovery.
 4. no interested nodes in refmap.
 If the the above four are satisfied, we are going to purge it(remove it
 from hash table).

 While, when a create lock is in progress especially when lockres is owned
 remotely(spinlocks are dropped when networking), the lockres can satisfy above
 four condition. If it's put to purge list, it can be purged out from hash 
 table
 when we are still accessing on it(sending request to owner for example). 
 That's
 not what we want.

 I have met such a problem (orabug 12405575).
 The lockres is in the purge list already(there is a delay for real purge work)
 and the create lock request comes. When we are sending network message to the
 owner in dlmlock_remote(), the owner crashed. So we get DLM_RECOVERING as 
 return
 value and retries dlmlock_remote(). And before the owner crash, we have purged
 the lockres. So the lockres become stale(on lockres-onwer). Thus the code 
 calls
 dlmlock_remote() infinitely.

 fix:
 we remove the lockres from purge list if it's there in dlm_get_lock_resource()
 which is called for only createlock case. So that the lockres can't be purged
 when we are in progress of createlock.

 Signed-off-by: Wengang Wangwen.gang.w...@oracle.com
 ---
   fs/ocfs2/dlm/dlmmaster.c |   41 +
   1 files changed, 29 insertions(+), 12 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
 index 11eefb8..511d43c 100644
 --- a/fs/ocfs2/dlm/dlmmaster.c
 +++ b/fs/ocfs2/dlm/dlmmaster.c
 @@ -709,28 +709,27 @@ lookup:
   spin_lock(dlm-spinlock);
   tmpres = __dlm_lookup_lockres_full(dlm, lockid, namelen, hash);
   if (tmpres) {
 - int dropping_ref = 0;
 -
 - spin_unlock(dlm-spinlock);
 -
   spin_lock(tmpres-spinlock);
   /* We wait for the other thread that is mastering the resource 
 */
   if (tmpres-owner == DLM_LOCK_RES_OWNER_UNKNOWN) {
 + spin_unlock(dlm-spinlock);
   __dlm_wait_on_lockres(tmpres);
   BUG_ON(tmpres-owner == DLM_LOCK_RES_OWNER_UNKNOWN);
 + spin_unlock(tmpres-spinlock);
 + dlm_lockres_put(tmpres);
 + tmpres = NULL;
 + goto lookup;
   }

   if (tmpres-owner == dlm-node_num) {
   BUG_ON(tmpres-state  DLM_LOCK_RES_DROPPING_REF);
   dlm_lockres_grab_inflight_ref(dlm, tmpres);
 - } else if (tmpres-state  DLM_LOCK_RES_DROPPING_REF)
 - dropping_ref = 1;
 - spin_unlock(tmpres-spinlock);
 -
 - /* wait until done messaging the master, drop our ref to allow
 -  * the lockres to be purged, start over. */
 - if (dropping_ref) {
 - spin_lock(tmpres-spinlock);
 + } else if (tmpres-state  DLM_LOCK_RES_DROPPING_REF) {
 + /*
 +  * wait until done messaging the master, drop our ref to
 +  * allow the lockres to be purged, start over.
 +  */
 + spin_unlock(dlm-spinlock);
   __dlm_wait_on_lockres_flags(tmpres, 
 DLM_LOCK_RES_DROPPING_REF);
   spin_unlock(tmpres-spinlock);
   dlm_lockres_put(tmpres);
 @@ -739,6 +738,24 @@ lookup:
   }

   mlog(0, found in hash!\n);
 + /*
 +  * we are going to do a create-lock next. so remove the lockres
 +  * from purge list to avoid the case that we will access on the
 +  * lockres which is already purged out from hash table in
 +  * dlm_run_purge_list() path.
 +  * otherwise, we could run into a problem:
 +  * the owner dead(recovery didn't take care of this lockres
 +  * since it's not in hashtable), and the code keeps sending
 +  * request to the dead node and getting DLM_RECOVERING and
 +  * then retrying infinitely.
 +  */
 + if (!list_empty(tmpres-purge)) {
 + list_del_init(tmpres-purge);
 + dlm_lockres_put(tmpres);
 + }
 + 
 + spin_unlock(tmpres-spinlock);
 + spin_unlock(dlm-spinlock);
   if (res)
   dlm_lockres_put(res);
   res = tmpres;

In short, you are holding onto the dlm-spinlock a bit longer and forcibly
removing the lockres from the purgelist.

I have two problems with this patch.

Firstly it ignores the fact that the resource can be added to the purgelist
right after we drop the dlm-spinlock. There is nothing to protect against
that. And 

Re: [Ocfs2-devel] [PATCH] remove lockres from purge list when we are getting it for creating lock

2011-06-09 Thread Sunil Mushran
On 06/09/2011 06:01 PM, Wengang Wang wrote:

 Yes, you are right. There is such a problem that the lockres can be
 added to purge list after we drop dlm-spinlock. I am not sure if it's the 
 more
 likely case since there is a 8 seconds delay between the the time the lockres 
 is
 added to purge list and the time it is purged. I guess in normal case, the 
 create-
 lock should already finished in the 8 seconds.

 I considered about the inflight_locks. But I didn't think it well so far.
 Because simply moving the lockres out from purge list is not good
 enough, I will take a good think about making use of inflight_locks.

 Secondly, we currently manipulate the purgelist in one function only.
 __dlm_calc_lockres_usage(). We should stick to that.
 If we make good use of inflight_locks, I think we can.
 BTW, how are you testing this?
 I tested with the steps which can cause the original problem to prove it
 works.(but without UT).
 Also I made a regression test with the existing ocfs2-test suites(the
 multiple-xxx with reducing interations).

What's UT?

I am looking for is a testcase that reliably reproduces the problem.
Not necessarily that can be checked in. But something that triggers
the issue.

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel