Re: [Ocfs2-devel] [PATCH] ocfs2: Fix quota file corruption
On Thu, Feb 20, 2014 at 12:39:59PM +0100, Jan Kara wrote: Global quota files are accessed from different nodes. Thus we cannot cache offset of quota structure in the quota file after we drop our node reference count to it because after that moment quota structure may be freed and reallocated elsewhere by a different node resulting in corruption of quota file. Fix the problem by clearing dq_off when we are releasing dquot structure. We also remove the DB_READ_B handling because it is useless - DQ_ACTIVE_B is set iff DQ_READ_B is set. CC: sta...@vger.kernel.org CC: Goldwyn Rodrigues rgold...@suse.de CC: Mark Fasheh mfas...@suse.de Signed-off-by: Jan Kara j...@suse.cz Thanks Jan, this looks good. Reviewed-by: Mark Fasheh mfas...@suse.de --Mark -- Mark Fasheh ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
Junxiao, thanks for looking into this issue. Please see my comment below On 02/24/2014 01:07 AM, Junxiao Bi wrote: Hi, On 07/19/2012 09:59 AM, Sunil Mushran wrote: Different issues. On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: On 07/19/2012 12:36 AM, Sunil Mushran wrote: This bug was detected during code audit. Never seen a crash. If it does hit, then we have bigger problems. So no point posting to stable. I read a lot of dlm recovery code recently, I found this bug could happen at the following scenario. node 1: migrate target node x: dlm_unregister_domain() dlm_migrate_all_locks() dlm_empty_lockres() select node x as migrate target node since there is a node x lock on the granted list. dlm_migrate_lockres() dlm_mark_lockres_migrating() { wait_event(dlm-ast_wq, !dlm_lockres_is_dirty(dlm, res)); node x unlock may happen here, res-granted list can be empty. If the unlock request got sent at this point, and if the request was *processed*, lock must have been removed from the granted_list. If the request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set in dlm_lockres_release_ast would make dlm_unlock handler to return DLM_MIGRATING to the caller (in this case node x). So I don't see how granted_list could have stale lock. Am I missing something ? I do think there is such race that you pointed below exist, but I am not sure if it was due to the above race described. dlm_lockres_release_ast(dlm, res); } dlm_send_one_lockres() dlm_process_recovery_data() { tmpq is res-granted list and is empty. list_for_each_entry(lock, tmpq, list) { if (lock-ml.cookie != ml-cookie) lock = NULL; else break; } lock will be invalid here. if (lock-ml.node != ml-node) BUG() -- crash here. } Thanks, Junxiao. Our customer can reproduce it. Also I saw you were assigned a similar bug before, see https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the same BUG? On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: Hi Sunil, On 07/18/2012 03:49 AM, Sunil Mushran wrote: On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: In the target node of the dlm lock migration, the logic to find the local dlm lock is wrong, it shouldn't change the loop variable lock in the list_for_each_entry loop. This will cause a NULL-pointer accessing crash. Signed-off-by: Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com Cc: sta...@vger.kernel.org mailto:sta...@vger.kernel.org --- fs/ocfs2/dlm/dlmrecovery.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index 01ebfd0..0b9cc88 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, u8 from = O2NM_MAX_NODES; unsigned int added = 0; __be64 c; + int found; mlog(0, running %d locks for this lockres\n, mres-num_locks); for (i=0; imres-num_locks; i++) { @@ -1793,22 +1794,23 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, /* MIGRATION ONLY! */ BUG_ON(!(mres-flags DLM_MRES_MIGRATION)); + found = 0; spin_lock(res-spinlock); for (j = DLM_GRANTED_LIST; j = DLM_BLOCKED_LIST; j++) { tmpq = dlm_list_idx_to_ptr(res, j); list_for_each_entry(lock, tmpq, list) { - if (lock-ml.cookie != ml-cookie) - lock = NULL; - else + if (lock-ml.cookie == ml-cookie) { + found = 1;
Re: [Ocfs2-devel] [PATCH 1/6] ocfs2: Remove OCFS2_INODE_SKIP_DELETE flag
On Fri, Feb 21, 2014 at 10:44:59AM +0100, Jan Kara wrote: The flag was never set, delete it. Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com Signed-off-by: Jan Kara j...@suse.cz ok, that was easy :) Reviewed-by: Mark Fasheh mfas...@suse.de --Mark -- Mark Fasheh ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH 2/6] ocfs2: Move dquot_initialize() in ocfs2_delete_inode() somewhat later
On Fri, Feb 21, 2014 at 10:45:00AM +0100, Jan Kara wrote: Move dquot_initalize() call in ocfs2_delete_inode() after the moment we verify inode is actually a sane one to delete. We certainly don't want to initialize quota for system inodes etc. This also avoids calling into quota code from downconvert thread. Add more details into the comment why bailing out from ocfs2_delete_inode() when we are in downconvert thread is OK. Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com Signed-off-by: Jan Kara j...@suse.cz --- fs/ocfs2/inode.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) Reviewed-by: Mark Fasheh mfas...@suse.de -- Mark Fasheh ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
Hi Srini, On 02/25/2014 07:30 AM, Srinivas Eeda wrote: Junxiao, thanks for looking into this issue. Please see my comment below On 02/24/2014 01:07 AM, Junxiao Bi wrote: Hi, On 07/19/2012 09:59 AM, Sunil Mushran wrote: Different issues. On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: On 07/19/2012 12:36 AM, Sunil Mushran wrote: This bug was detected during code audit. Never seen a crash. If it does hit, then we have bigger problems. So no point posting to stable. I read a lot of dlm recovery code recently, I found this bug could happen at the following scenario. node 1: migrate target node x: dlm_unregister_domain() dlm_migrate_all_locks() dlm_empty_lockres() select node x as migrate target node since there is a node x lock on the granted list. dlm_migrate_lockres() dlm_mark_lockres_migrating() { wait_event(dlm-ast_wq, !dlm_lockres_is_dirty(dlm, res)); node x unlock may happen here, res-granted list can be empty. If the unlock request got sent at this point, and if the request was *processed*, lock must have been removed from the granted_list. If the request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set in dlm_lockres_release_ast would make dlm_unlock handler to return DLM_MIGRATING to the caller (in this case node x). So I don't see how granted_list could have stale lock. Am I missing something ? I agree granted_list will not have stale lock. The issue is triggered when there is no locks in the granted_list. In migrate target node, the granted_list is also empty after unlock. Then due to the wrong use of list_for_each_entry in the following code, lock will be not null even the granted_list is null. The lock is invalid and lock-ml.node != ml-node may be true and cause the bug. for (j = DLM_GRANTED_LIST; j = DLM_BLOCKED_LIST; j++) { tmpq = dlm_list_idx_to_ptr(res, j); list_for_each_entry(lock, tmpq, list) { if (lock-ml.cookie != ml-cookie) lock = NULL; else break; } if (lock) break; } /* lock is always created locally first, and * destroyed locally last. it must be on the list */ if (!lock) { c = ml-cookie; BUG(); } if (lock-ml.node != ml-node) { c = lock-ml.cookie; c = ml-cookie; BUG(); } Thanks, Junxiao. I do think there is such race that you pointed below exist, but I am not sure if it was due to the above race described. dlm_lockres_release_ast(dlm, res); } dlm_send_one_lockres() dlm_process_recovery_data() { tmpq is res-granted list and is empty. list_for_each_entry(lock, tmpq, list) { if (lock-ml.cookie != ml-cookie) lock = NULL; else break; } lock will be invalid here. if (lock-ml.node != ml-node) BUG() -- crash here. } Thanks, Junxiao. Our customer can reproduce it. Also I saw you were assigned a similar bug before, see https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the same BUG? On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: Hi Sunil, On 07/18/2012 03:49 AM, Sunil Mushran wrote: On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: In the target node of the dlm lock migration, the logic to find the local dlm lock is wrong, it shouldn't change the loop variable lock in the list_for_each_entry loop. This will cause a NULL-pointer
Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
On 02/25/2014 07:30 AM, Srinivas Eeda wrote: Junxiao, thanks for looking into this issue. Please see my comment below On 02/24/2014 01:07 AM, Junxiao Bi wrote: Hi, On 07/19/2012 09:59 AM, Sunil Mushran wrote: Different issues. On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: On 07/19/2012 12:36 AM, Sunil Mushran wrote: This bug was detected during code audit. Never seen a crash. If it does hit, then we have bigger problems. So no point posting to stable. I read a lot of dlm recovery code recently, I found this bug could happen at the following scenario. node 1: migrate target node x: dlm_unregister_domain() dlm_migrate_all_locks() dlm_empty_lockres() select node x as migrate target node since there is a node x lock on the granted list. dlm_migrate_lockres() dlm_mark_lockres_migrating() { wait_event(dlm-ast_wq, !dlm_lockres_is_dirty(dlm, res)); node x unlock may happen here, res-granted list can be empty. If the unlock request got sent at this point, and if the request was *processed*, lock must have been removed from the granted_list. If the request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set in dlm_lockres_release_ast would make dlm_unlock handler to return DLM_MIGRATING to the caller (in this case node x). So I don't see how granted_list could have stale lock. Am I missing something ? I do think there is such race that you pointed below exist, but I am not sure if it was due to the above race described. Outside the windows from set RES_BLOCK_DIRTY flag and wait_event() to dlm_lockres_release_ast(), granted_list can not be empty, since wait_event will wait until dlm_thread clear the dirty flag where shuffle list will pick another lock to the granted list. After the window, DLM_MIGRATING flag will stop other node unlock to the granted list. So I think this cause the empty granted list and cause the migrate target panic. I didn't see any other harm of this since the migrate target node will shuffle the list and send the ast message later. Thanks, Junxiao. dlm_lockres_release_ast(dlm, res); } dlm_send_one_lockres() dlm_process_recovery_data() { tmpq is res-granted list and is empty. list_for_each_entry(lock, tmpq, list) { if (lock-ml.cookie != ml-cookie) lock = NULL; else break; } lock will be invalid here. if (lock-ml.node != ml-node) BUG() -- crash here. } Thanks, Junxiao. Our customer can reproduce it. Also I saw you were assigned a similar bug before, see https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the same BUG? On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: Hi Sunil, On 07/18/2012 03:49 AM, Sunil Mushran wrote: On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: In the target node of the dlm lock migration, the logic to find the local dlm lock is wrong, it shouldn't change the loop variable lock in the list_for_each_entry loop. This will cause a NULL-pointer accessing crash. Signed-off-by: Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com Cc: sta...@vger.kernel.org mailto:sta...@vger.kernel.org --- fs/ocfs2/dlm/dlmrecovery.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index 01ebfd0..0b9cc88 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, u8 from = O2NM_MAX_NODES; unsigned int added = 0; __be64 c; + int found; mlog(0, running %d locks for this lockres\n, mres-num_locks); for (i=0; imres-num_locks; i++) { @@ -1793,22 +1794,23 @@ static int