Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
Junxiao, thanks for looking into this issue. Please see my comment below On 02/24/2014 01:07 AM, Junxiao Bi wrote: Hi, On 07/19/2012 09:59 AM, Sunil Mushran wrote: Different issues. On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: On 07/19/2012 12:36 AM, Sunil Mushran wrote: This bug was detected during code audit. Never seen a crash. If it does hit, then we have bigger problems. So no point posting to stable. I read a lot of dlm recovery code recently, I found this bug could happen at the following scenario. node 1: migrate target node x: dlm_unregister_domain() dlm_migrate_all_locks() dlm_empty_lockres() select node x as migrate target node since there is a node x lock on the granted list. dlm_migrate_lockres() dlm_mark_lockres_migrating() { wait_event(dlm-ast_wq, !dlm_lockres_is_dirty(dlm, res)); node x unlock may happen here, res-granted list can be empty. If the unlock request got sent at this point, and if the request was *processed*, lock must have been removed from the granted_list. If the request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set in dlm_lockres_release_ast would make dlm_unlock handler to return DLM_MIGRATING to the caller (in this case node x). So I don't see how granted_list could have stale lock. Am I missing something ? I do think there is such race that you pointed below exist, but I am not sure if it was due to the above race described. dlm_lockres_release_ast(dlm, res); } dlm_send_one_lockres() dlm_process_recovery_data() { tmpq is res-granted list and is empty. list_for_each_entry(lock, tmpq, list) { if (lock-ml.cookie != ml-cookie) lock = NULL; else break; } lock will be invalid here. if (lock-ml.node != ml-node) BUG() -- crash here. } Thanks, Junxiao. Our customer can reproduce it. Also I saw you were assigned a similar bug before, see https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the same BUG? On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: Hi Sunil, On 07/18/2012 03:49 AM, Sunil Mushran wrote: On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: In the target node of the dlm lock migration, the logic to find the local dlm lock is wrong, it shouldn't change the loop variable lock in the list_for_each_entry loop. This will cause a NULL-pointer accessing crash. Signed-off-by: Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com Cc: sta...@vger.kernel.org mailto:sta...@vger.kernel.org --- fs/ocfs2/dlm/dlmrecovery.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index 01ebfd0..0b9cc88 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, u8 from = O2NM_MAX_NODES; unsigned int added = 0; __be64 c; + int found; mlog(0, running %d locks for this lockres\n, mres-num_locks); for (i=0; imres-num_locks; i++) { @@ -1793,22 +1794,23 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, /* MIGRATION ONLY! */ BUG_ON(!(mres-flags DLM_MRES_MIGRATION)); + found = 0; spin_lock(res-spinlock); for (j = DLM_GRANTED_LIST; j = DLM_BLOCKED_LIST; j++) { tmpq = dlm_list_idx_to_ptr(res, j); list_for_each_entry(lock, tmpq, list) { - if (lock-ml.cookie != ml-cookie) - lock = NULL; - else + if (lock-ml.cookie == ml-cookie) { + found = 1;
Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
Hi Srini, On 02/25/2014 07:30 AM, Srinivas Eeda wrote: Junxiao, thanks for looking into this issue. Please see my comment below On 02/24/2014 01:07 AM, Junxiao Bi wrote: Hi, On 07/19/2012 09:59 AM, Sunil Mushran wrote: Different issues. On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: On 07/19/2012 12:36 AM, Sunil Mushran wrote: This bug was detected during code audit. Never seen a crash. If it does hit, then we have bigger problems. So no point posting to stable. I read a lot of dlm recovery code recently, I found this bug could happen at the following scenario. node 1: migrate target node x: dlm_unregister_domain() dlm_migrate_all_locks() dlm_empty_lockres() select node x as migrate target node since there is a node x lock on the granted list. dlm_migrate_lockres() dlm_mark_lockres_migrating() { wait_event(dlm-ast_wq, !dlm_lockres_is_dirty(dlm, res)); node x unlock may happen here, res-granted list can be empty. If the unlock request got sent at this point, and if the request was *processed*, lock must have been removed from the granted_list. If the request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set in dlm_lockres_release_ast would make dlm_unlock handler to return DLM_MIGRATING to the caller (in this case node x). So I don't see how granted_list could have stale lock. Am I missing something ? I agree granted_list will not have stale lock. The issue is triggered when there is no locks in the granted_list. In migrate target node, the granted_list is also empty after unlock. Then due to the wrong use of list_for_each_entry in the following code, lock will be not null even the granted_list is null. The lock is invalid and lock-ml.node != ml-node may be true and cause the bug. for (j = DLM_GRANTED_LIST; j = DLM_BLOCKED_LIST; j++) { tmpq = dlm_list_idx_to_ptr(res, j); list_for_each_entry(lock, tmpq, list) { if (lock-ml.cookie != ml-cookie) lock = NULL; else break; } if (lock) break; } /* lock is always created locally first, and * destroyed locally last. it must be on the list */ if (!lock) { c = ml-cookie; BUG(); } if (lock-ml.node != ml-node) { c = lock-ml.cookie; c = ml-cookie; BUG(); } Thanks, Junxiao. I do think there is such race that you pointed below exist, but I am not sure if it was due to the above race described. dlm_lockres_release_ast(dlm, res); } dlm_send_one_lockres() dlm_process_recovery_data() { tmpq is res-granted list and is empty. list_for_each_entry(lock, tmpq, list) { if (lock-ml.cookie != ml-cookie) lock = NULL; else break; } lock will be invalid here. if (lock-ml.node != ml-node) BUG() -- crash here. } Thanks, Junxiao. Our customer can reproduce it. Also I saw you were assigned a similar bug before, see https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the same BUG? On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: Hi Sunil, On 07/18/2012 03:49 AM, Sunil Mushran wrote: On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: In the target node of the dlm lock migration, the logic to find the local dlm lock is wrong, it shouldn't change the loop variable lock in the list_for_each_entry loop. This will cause a NULL-pointer
Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
On 02/25/2014 07:30 AM, Srinivas Eeda wrote: Junxiao, thanks for looking into this issue. Please see my comment below On 02/24/2014 01:07 AM, Junxiao Bi wrote: Hi, On 07/19/2012 09:59 AM, Sunil Mushran wrote: Different issues. On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: On 07/19/2012 12:36 AM, Sunil Mushran wrote: This bug was detected during code audit. Never seen a crash. If it does hit, then we have bigger problems. So no point posting to stable. I read a lot of dlm recovery code recently, I found this bug could happen at the following scenario. node 1: migrate target node x: dlm_unregister_domain() dlm_migrate_all_locks() dlm_empty_lockres() select node x as migrate target node since there is a node x lock on the granted list. dlm_migrate_lockres() dlm_mark_lockres_migrating() { wait_event(dlm-ast_wq, !dlm_lockres_is_dirty(dlm, res)); node x unlock may happen here, res-granted list can be empty. If the unlock request got sent at this point, and if the request was *processed*, lock must have been removed from the granted_list. If the request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set in dlm_lockres_release_ast would make dlm_unlock handler to return DLM_MIGRATING to the caller (in this case node x). So I don't see how granted_list could have stale lock. Am I missing something ? I do think there is such race that you pointed below exist, but I am not sure if it was due to the above race described. Outside the windows from set RES_BLOCK_DIRTY flag and wait_event() to dlm_lockres_release_ast(), granted_list can not be empty, since wait_event will wait until dlm_thread clear the dirty flag where shuffle list will pick another lock to the granted list. After the window, DLM_MIGRATING flag will stop other node unlock to the granted list. So I think this cause the empty granted list and cause the migrate target panic. I didn't see any other harm of this since the migrate target node will shuffle the list and send the ast message later. Thanks, Junxiao. dlm_lockres_release_ast(dlm, res); } dlm_send_one_lockres() dlm_process_recovery_data() { tmpq is res-granted list and is empty. list_for_each_entry(lock, tmpq, list) { if (lock-ml.cookie != ml-cookie) lock = NULL; else break; } lock will be invalid here. if (lock-ml.node != ml-node) BUG() -- crash here. } Thanks, Junxiao. Our customer can reproduce it. Also I saw you were assigned a similar bug before, see https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the same BUG? On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: Hi Sunil, On 07/18/2012 03:49 AM, Sunil Mushran wrote: On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: In the target node of the dlm lock migration, the logic to find the local dlm lock is wrong, it shouldn't change the loop variable lock in the list_for_each_entry loop. This will cause a NULL-pointer accessing crash. Signed-off-by: Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com Cc: sta...@vger.kernel.org mailto:sta...@vger.kernel.org --- fs/ocfs2/dlm/dlmrecovery.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index 01ebfd0..0b9cc88 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, u8 from = O2NM_MAX_NODES; unsigned int added = 0; __be64 c; + int found; mlog(0, running %d locks for this lockres\n, mres-num_locks); for (i=0; imres-num_locks; i++) { @@ -1793,22 +1794,23 @@ static int
[Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
In the target node of the dlm lock migration, the logic to find the local dlm lock is wrong, it shouldn't change the loop variable lock in the list_for_each_entry loop. This will cause a NULL-pointer accessing crash. Signed-off-by: Junxiao Bi junxiao...@oracle.com Cc: sta...@vger.kernel.org --- fs/ocfs2/dlm/dlmrecovery.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index 01ebfd0..0b9cc88 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, u8 from = O2NM_MAX_NODES; unsigned int added = 0; __be64 c; + int found; mlog(0, running %d locks for this lockres\n, mres-num_locks); for (i=0; imres-num_locks; i++) { @@ -1793,22 +1794,23 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, /* MIGRATION ONLY! */ BUG_ON(!(mres-flags DLM_MRES_MIGRATION)); + found = 0; spin_lock(res-spinlock); for (j = DLM_GRANTED_LIST; j = DLM_BLOCKED_LIST; j++) { tmpq = dlm_list_idx_to_ptr(res, j); list_for_each_entry(lock, tmpq, list) { - if (lock-ml.cookie != ml-cookie) - lock = NULL; - else + if (lock-ml.cookie == ml-cookie) { + found = 1; break; + } } - if (lock) + if (found) break; } /* lock is always created locally first, and * destroyed locally last. it must be on the list */ - if (!lock) { + if (!found) { c = ml-cookie; mlog(ML_ERROR, Could not find local lock with cookie %u:%llu, node %u, -- 1.7.9.5 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi junxiao...@oracle.com wrote: In the target node of the dlm lock migration, the logic to find the local dlm lock is wrong, it shouldn't change the loop variable lock in the list_for_each_entry loop. This will cause a NULL-pointer accessing crash. Signed-off-by: Junxiao Bi junxiao...@oracle.com Cc: sta...@vger.kernel.org --- fs/ocfs2/dlm/dlmrecovery.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index 01ebfd0..0b9cc88 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, u8 from = O2NM_MAX_NODES; unsigned int added = 0; __be64 c; + int found; mlog(0, running %d locks for this lockres\n, mres-num_locks); for (i=0; imres-num_locks; i++) { @@ -1793,22 +1794,23 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, /* MIGRATION ONLY! */ BUG_ON(!(mres-flags DLM_MRES_MIGRATION)); + found = 0; spin_lock(res-spinlock); for (j = DLM_GRANTED_LIST; j = DLM_BLOCKED_LIST; j++) { tmpq = dlm_list_idx_to_ptr(res, j); list_for_each_entry(lock, tmpq, list) { - if (lock-ml.cookie != ml-cookie) - lock = NULL; - else + if (lock-ml.cookie == ml-cookie) { + found = 1; break; + } } - if (lock) + if (found) break; } /* lock is always created locally first, and * destroyed locally last. it must be on the list */ - if (!lock) { + if (!found) { c = ml-cookie; mlog(ML_ERROR, Could not find local lock with cookie %u:%llu, node %u, https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47 We had decided to go back to list_for_each(). ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
Hi Sunil, On 07/18/2012 03:49 AM, Sunil Mushran wrote: On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: In the target node of the dlm lock migration, the logic to find the local dlm lock is wrong, it shouldn't change the loop variable lock in the list_for_each_entry loop. This will cause a NULL-pointer accessing crash. Signed-off-by: Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com Cc: sta...@vger.kernel.org mailto:sta...@vger.kernel.org --- fs/ocfs2/dlm/dlmrecovery.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index 01ebfd0..0b9cc88 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, u8 from = O2NM_MAX_NODES; unsigned int added = 0; __be64 c; + int found; mlog(0, running %d locks for this lockres\n, mres-num_locks); for (i=0; imres-num_locks; i++) { @@ -1793,22 +1794,23 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, /* MIGRATION ONLY! */ BUG_ON(!(mres-flags DLM_MRES_MIGRATION)); + found = 0; spin_lock(res-spinlock); for (j = DLM_GRANTED_LIST; j = DLM_BLOCKED_LIST; j++) { tmpq = dlm_list_idx_to_ptr(res, j); list_for_each_entry(lock, tmpq, list) { - if (lock-ml.cookie != ml-cookie) - lock = NULL; - else + if (lock-ml.cookie == ml-cookie) { + found = 1; break; + } } - if (lock) + if (found) break; } /* lock is always created locally first, and * destroyed locally last. it must be on the list */ - if (!lock) { + if (!found) { c = ml-cookie; mlog(ML_ERROR, Could not find local lock with cookie %u:%llu, node %u, https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47 We had decided to go back to list_for_each(). OK, thank you. It's OK to revert it back for a introduced bug. But I think you'd better cc stable branch. ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel