Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash

2014-02-24 Thread Srinivas Eeda

Junxiao, thanks for looking into this issue. Please see my comment below

On 02/24/2014 01:07 AM, Junxiao Bi wrote:

Hi,

On 07/19/2012 09:59 AM, Sunil Mushran wrote:

Different issues.

On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi junxiao...@oracle.com 
mailto:junxiao...@oracle.com wrote:


On 07/19/2012 12:36 AM, Sunil Mushran wrote:

This bug was detected during code audit. Never seen a crash. If
it does hit,
then we have bigger problems. So no point posting to stable.


I read a lot of dlm recovery code recently, I found this bug could 
happen at the following scenario.


node 1: migrate target node x:
dlm_unregister_domain()
 dlm_migrate_all_locks()
  dlm_empty_lockres()
   select node x as migrate target node
   since there is a node x lock on the granted list.
   dlm_migrate_lockres()
dlm_mark_lockres_migrating() {
 wait_event(dlm-ast_wq, !dlm_lockres_is_dirty(dlm, res));
 node x unlock may happen here, res-granted list can be empty.
If the unlock request got sent at this point, and if the request was 
*processed*, lock must have been removed from the granted_list. If the 
request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set in 
dlm_lockres_release_ast would make dlm_unlock handler to return 
DLM_MIGRATING to the caller (in this case node x). So I don't see how 
granted_list could have stale lock. Am I missing something ?


I do think there is such race that you pointed below exist, but I am not 
sure if it was due to the above race described.



dlm_lockres_release_ast(dlm, res);
}
dlm_send_one_lockres()
dlm_process_recovery_data() {
tmpq is 
res-granted list and is empty.

list_for_each_entry(lock, tmpq, list) {
 if 
(lock-ml.cookie != ml-cookie)

  lock = NULL;
 else
  break;
}
lock will be 
invalid here.
if (lock-ml.node 
!= ml-node)
BUG() -- 
crash here.

   }

Thanks,
Junxiao.


Our customer can reproduce it. Also I saw you were assigned a
similar bug before, see
https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the
same BUG?


On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi
junxiao...@oracle.com mailto:junxiao...@oracle.com wrote:

Hi Sunil,

On 07/18/2012 03:49 AM, Sunil Mushran wrote:

On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi
junxiao...@oracle.com mailto:junxiao...@oracle.com wrote:

In the target node of the dlm lock migration, the logic
to find
the local dlm lock is wrong, it shouldn't change the
loop variable
lock in the list_for_each_entry loop. This will cause
a NULL-pointer
accessing crash.

Signed-off-by: Junxiao Bi junxiao...@oracle.com
mailto:junxiao...@oracle.com
Cc: sta...@vger.kernel.org mailto:sta...@vger.kernel.org
---
 fs/ocfs2/dlm/dlmrecovery.c |   12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmrecovery.c
b/fs/ocfs2/dlm/dlmrecovery.c
index 01ebfd0..0b9cc88 100644
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -1762,6 +1762,7 @@ static int
dlm_process_recovery_data(struct dlm_ctxt *dlm,
u8 from = O2NM_MAX_NODES;
unsigned int added = 0;
__be64 c;
+   int found;

mlog(0, running %d locks for this lockres\n,
mres-num_locks);
for (i=0; imres-num_locks; i++) {
@@ -1793,22 +1794,23 @@ static int
dlm_process_recovery_data(struct dlm_ctxt *dlm,
/* MIGRATION ONLY! */
BUG_ON(!(mres-flags  DLM_MRES_MIGRATION));

+   found = 0;
spin_lock(res-spinlock);
for (j = DLM_GRANTED_LIST; j =
DLM_BLOCKED_LIST; j++) {
tmpq =
dlm_list_idx_to_ptr(res, j);
list_for_each_entry(lock, tmpq, list) {
-   if
(lock-ml.cookie != ml-cookie)
-   lock = NULL;
- else
+   if
(lock-ml.cookie == ml-cookie) {
+   found = 1;
  

Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash

2014-02-24 Thread Junxiao Bi
Hi Srini,

On 02/25/2014 07:30 AM, Srinivas Eeda wrote:
 Junxiao, thanks for looking into this issue. Please see my comment below

 On 02/24/2014 01:07 AM, Junxiao Bi wrote:
 Hi,

 On 07/19/2012 09:59 AM, Sunil Mushran wrote:
 Different issues.

 On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi junxiao...@oracle.com
 mailto:junxiao...@oracle.com wrote:

 On 07/19/2012 12:36 AM, Sunil Mushran wrote:
 This bug was detected during code audit. Never seen a crash. If
 it does hit,
 then we have bigger problems. So no point posting to stable.

 I read a lot of dlm recovery code recently, I found this bug could
 happen at the following scenario.

 node 1: migrate target
 node x:
 dlm_unregister_domain()
  dlm_migrate_all_locks()
   dlm_empty_lockres()
select node x as migrate target node
since there is a node x lock on the granted list.
dlm_migrate_lockres()
 dlm_mark_lockres_migrating() {
  wait_event(dlm-ast_wq, !dlm_lockres_is_dirty(dlm, res));
  node x unlock may happen here, res-granted list can be empty.
 If the unlock request got sent at this point, and if the request was
 *processed*, lock must have been removed from the granted_list. If the
 request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set
 in dlm_lockres_release_ast would make dlm_unlock handler to return
 DLM_MIGRATING to the caller (in this case node x). So I don't see how
 granted_list could have stale lock. Am I missing something ?
I agree granted_list will not have stale lock. The issue is triggered
when there is no locks in the granted_list. In migrate target node, the
granted_list is also empty after unlock. Then due to the wrong use of
list_for_each_entry in the following code, lock will be not null even
the granted_list is null. The lock is invalid and lock-ml.node !=
ml-node may be true and cause the bug.


 for (j = DLM_GRANTED_LIST; j =
DLM_BLOCKED_LIST; j++) {
tmpq = dlm_list_idx_to_ptr(res, j);
list_for_each_entry(lock, tmpq, list) {
if (lock-ml.cookie != ml-cookie)
lock = NULL;
else
break;
}
if (lock)
break;
}

/* lock is always created locally first, and
 * destroyed locally last.  it must be on the
list */
if (!lock) {
c = ml-cookie;
   
BUG();
}

if (lock-ml.node != ml-node) {
c = lock-ml.cookie;  
c = ml-cookie;   
BUG();
}

Thanks,
Junxiao.

 I do think there is such race that you pointed below exist, but I am
 not sure if it was due to the above race described.

  dlm_lockres_release_ast(dlm, res);
 }  
 dlm_send_one_lockres()
   
 dlm_process_recovery_data() {
 tmpq is
 res-granted list and is empty.

 list_for_each_entry(lock, tmpq, list) {
  if
 (lock-ml.cookie != ml-cookie)
   lock = NULL;
  else
   break;
 }  
 lock will be
 invalid here.
 if (lock-ml.node
 != ml-node)
 BUG() --
 crash here.
}

 Thanks,
 Junxiao.

 Our customer can reproduce it. Also I saw you were assigned a
 similar bug before, see
 https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the
 same BUG?

 On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi
 junxiao...@oracle.com mailto:junxiao...@oracle.com wrote:

 Hi Sunil,

 On 07/18/2012 03:49 AM, Sunil Mushran wrote:
 On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi
 junxiao...@oracle.com mailto:junxiao...@oracle.com wrote:

 In the target node of the dlm lock migration, the
 logic to find
 the local dlm lock is wrong, it shouldn't change the
 loop variable
 lock in the list_for_each_entry loop. This will
 cause a NULL-pointer
 

Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash

2014-02-24 Thread Junxiao Bi
On 02/25/2014 07:30 AM, Srinivas Eeda wrote:
 Junxiao, thanks for looking into this issue. Please see my comment below

 On 02/24/2014 01:07 AM, Junxiao Bi wrote:
 Hi,

 On 07/19/2012 09:59 AM, Sunil Mushran wrote:
 Different issues.

 On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi junxiao...@oracle.com
 mailto:junxiao...@oracle.com wrote:

 On 07/19/2012 12:36 AM, Sunil Mushran wrote:
 This bug was detected during code audit. Never seen a crash. If
 it does hit,
 then we have bigger problems. So no point posting to stable.

 I read a lot of dlm recovery code recently, I found this bug could
 happen at the following scenario.

 node 1: migrate target
 node x:
 dlm_unregister_domain()
  dlm_migrate_all_locks()
   dlm_empty_lockres()
select node x as migrate target node
since there is a node x lock on the granted list.
dlm_migrate_lockres()
 dlm_mark_lockres_migrating() {
  wait_event(dlm-ast_wq, !dlm_lockres_is_dirty(dlm, res));
  node x unlock may happen here, res-granted list can be empty.
 If the unlock request got sent at this point, and if the request was
 *processed*, lock must have been removed from the granted_list. If the
 request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set
 in dlm_lockres_release_ast would make dlm_unlock handler to return
 DLM_MIGRATING to the caller (in this case node x). So I don't see how
 granted_list could have stale lock. Am I missing something ?

 I do think there is such race that you pointed below exist, but I am
 not sure if it was due to the above race described.
Outside the windows from set RES_BLOCK_DIRTY flag and wait_event() to
dlm_lockres_release_ast(), granted_list can not be empty, since
wait_event will wait until dlm_thread clear the dirty flag where shuffle
list will pick another lock to the granted list. After the window,
DLM_MIGRATING flag will stop other node unlock to the granted list. So I
think this cause the empty granted list and cause the migrate target
panic. I didn't see any other harm of this since the migrate target node
will shuffle the list and send the ast message later.

Thanks,
Junxiao.

  dlm_lockres_release_ast(dlm, res);
 }  
 dlm_send_one_lockres()
   
 dlm_process_recovery_data() {
 tmpq is
 res-granted list and is empty.

 list_for_each_entry(lock, tmpq, list) {
  if
 (lock-ml.cookie != ml-cookie)
   lock = NULL;
  else
   break;
 }  
 lock will be
 invalid here.
 if (lock-ml.node
 != ml-node)
 BUG() --
 crash here.
}

 Thanks,
 Junxiao.

 Our customer can reproduce it. Also I saw you were assigned a
 similar bug before, see
 https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the
 same BUG?

 On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi
 junxiao...@oracle.com mailto:junxiao...@oracle.com wrote:

 Hi Sunil,

 On 07/18/2012 03:49 AM, Sunil Mushran wrote:
 On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi
 junxiao...@oracle.com mailto:junxiao...@oracle.com wrote:

 In the target node of the dlm lock migration, the
 logic to find
 the local dlm lock is wrong, it shouldn't change the
 loop variable
 lock in the list_for_each_entry loop. This will
 cause a NULL-pointer
 accessing crash.

 Signed-off-by: Junxiao Bi junxiao...@oracle.com
 mailto:junxiao...@oracle.com
 Cc: sta...@vger.kernel.org mailto:sta...@vger.kernel.org
 ---
  fs/ocfs2/dlm/dlmrecovery.c |   12 +++-
  1 file changed, 7 insertions(+), 5 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmrecovery.c
 b/fs/ocfs2/dlm/dlmrecovery.c
 index 01ebfd0..0b9cc88 100644
 --- a/fs/ocfs2/dlm/dlmrecovery.c
 +++ b/fs/ocfs2/dlm/dlmrecovery.c
 @@ -1762,6 +1762,7 @@ static int
 dlm_process_recovery_data(struct dlm_ctxt *dlm,
 u8 from = O2NM_MAX_NODES;
 unsigned int added = 0;
 __be64 c;
 +   int found;

 mlog(0, running %d locks for this lockres\n,
 mres-num_locks);
 for (i=0; imres-num_locks; i++) {
 @@ -1793,22 +1794,23 @@ static int

[Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash

2012-07-17 Thread Junxiao Bi
In the target node of the dlm lock migration, the logic to find
the local dlm lock is wrong, it shouldn't change the loop variable
lock in the list_for_each_entry loop. This will cause a NULL-pointer
accessing crash.

Signed-off-by: Junxiao Bi junxiao...@oracle.com
Cc: sta...@vger.kernel.org
---
 fs/ocfs2/dlm/dlmrecovery.c |   12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
index 01ebfd0..0b9cc88 100644
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm,
u8 from = O2NM_MAX_NODES;
unsigned int added = 0;
__be64 c;
+   int found;
 
mlog(0, running %d locks for this lockres\n, mres-num_locks);
for (i=0; imres-num_locks; i++) {
@@ -1793,22 +1794,23 @@ static int dlm_process_recovery_data(struct dlm_ctxt 
*dlm,
/* MIGRATION ONLY! */
BUG_ON(!(mres-flags  DLM_MRES_MIGRATION));
 
+   found = 0;
spin_lock(res-spinlock);
for (j = DLM_GRANTED_LIST; j = DLM_BLOCKED_LIST; j++) {
tmpq = dlm_list_idx_to_ptr(res, j);
list_for_each_entry(lock, tmpq, list) {
-   if (lock-ml.cookie != ml-cookie)
-   lock = NULL;
-   else
+   if (lock-ml.cookie == ml-cookie) {
+   found = 1;
break;
+   }
}
-   if (lock)
+   if (found)
break;
}
 
/* lock is always created locally first, and
 * destroyed locally last.  it must be on the list */
-   if (!lock) {
+   if (!found) {
c = ml-cookie;
mlog(ML_ERROR, Could not find local lock 
   with cookie %u:%llu, node %u, 
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash

2012-07-17 Thread Sunil Mushran
On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi junxiao...@oracle.com wrote:

 In the target node of the dlm lock migration, the logic to find
 the local dlm lock is wrong, it shouldn't change the loop variable
 lock in the list_for_each_entry loop. This will cause a NULL-pointer
 accessing crash.

 Signed-off-by: Junxiao Bi junxiao...@oracle.com
 Cc: sta...@vger.kernel.org
 ---
  fs/ocfs2/dlm/dlmrecovery.c |   12 +++-
  1 file changed, 7 insertions(+), 5 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
 index 01ebfd0..0b9cc88 100644
 --- a/fs/ocfs2/dlm/dlmrecovery.c
 +++ b/fs/ocfs2/dlm/dlmrecovery.c
 @@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct dlm_ctxt
 *dlm,
 u8 from = O2NM_MAX_NODES;
 unsigned int added = 0;
 __be64 c;
 +   int found;

 mlog(0, running %d locks for this lockres\n, mres-num_locks);
 for (i=0; imres-num_locks; i++) {
 @@ -1793,22 +1794,23 @@ static int dlm_process_recovery_data(struct
 dlm_ctxt *dlm,
 /* MIGRATION ONLY! */
 BUG_ON(!(mres-flags  DLM_MRES_MIGRATION));

 +   found = 0;
 spin_lock(res-spinlock);
 for (j = DLM_GRANTED_LIST; j = DLM_BLOCKED_LIST;
 j++) {
 tmpq = dlm_list_idx_to_ptr(res, j);
 list_for_each_entry(lock, tmpq, list) {
 -   if (lock-ml.cookie != ml-cookie)
 -   lock = NULL;
 -   else
 +   if (lock-ml.cookie == ml-cookie)
 {
 +   found = 1;
 break;
 +   }
 }
 -   if (lock)
 +   if (found)
 break;
 }

 /* lock is always created locally first, and
  * destroyed locally last.  it must be on the list
 */
 -   if (!lock) {
 +   if (!found) {
 c = ml-cookie;
 mlog(ML_ERROR, Could not find local lock 
with cookie %u:%llu, node
 %u, 



https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47

We had decided to go back to list_for_each().
___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash

2012-07-17 Thread Junxiao Bi
Hi Sunil,

On 07/18/2012 03:49 AM, Sunil Mushran wrote:
 On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi junxiao...@oracle.com
 mailto:junxiao...@oracle.com wrote:

 In the target node of the dlm lock migration, the logic to find
 the local dlm lock is wrong, it shouldn't change the loop variable
 lock in the list_for_each_entry loop. This will cause a NULL-pointer
 accessing crash.

 Signed-off-by: Junxiao Bi junxiao...@oracle.com
 mailto:junxiao...@oracle.com
 Cc: sta...@vger.kernel.org mailto:sta...@vger.kernel.org
 ---
  fs/ocfs2/dlm/dlmrecovery.c |   12 +++-
  1 file changed, 7 insertions(+), 5 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
 index 01ebfd0..0b9cc88 100644
 --- a/fs/ocfs2/dlm/dlmrecovery.c
 +++ b/fs/ocfs2/dlm/dlmrecovery.c
 @@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct
 dlm_ctxt *dlm,
 u8 from = O2NM_MAX_NODES;
 unsigned int added = 0;
 __be64 c;
 +   int found;

 mlog(0, running %d locks for this lockres\n,
 mres-num_locks);
 for (i=0; imres-num_locks; i++) {
 @@ -1793,22 +1794,23 @@ static int
 dlm_process_recovery_data(struct dlm_ctxt *dlm,
 /* MIGRATION ONLY! */
 BUG_ON(!(mres-flags  DLM_MRES_MIGRATION));

 +   found = 0;
 spin_lock(res-spinlock);
 for (j = DLM_GRANTED_LIST; j =
 DLM_BLOCKED_LIST; j++) {
 tmpq = dlm_list_idx_to_ptr(res, j);
 list_for_each_entry(lock, tmpq,
 list) {
 -   if (lock-ml.cookie !=
 ml-cookie)
 -   lock = NULL;
 -   else
 +   if (lock-ml.cookie ==
 ml-cookie) {
 +   found = 1;
 break;
 +   }
 }
 -   if (lock)
 +   if (found)
 break;
 }

 /* lock is always created locally first, and
  * destroyed locally last.  it must be on
 the list */
 -   if (!lock) {
 +   if (!found) {
 c = ml-cookie;
 mlog(ML_ERROR, Could not find
 local lock 
with cookie
 %u:%llu, node %u, 



 https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47


 We had decided to go back to list_for_each().

OK, thank you. It's OK to revert it back for a introduced bug. But I
think you'd better cc stable branch.
___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel