Re: [Ocfs2-devel] [PATCH] ocfs2: Fix quota file corruption

2014-02-24 Thread Mark Fasheh
On Thu, Feb 20, 2014 at 12:39:59PM +0100, Jan Kara wrote:
 Global quota files are accessed from different nodes. Thus we cannot
 cache offset of quota structure in the quota file after we drop our
 node reference count to it because after that moment quota structure may
 be freed and reallocated elsewhere by a different node resulting in
 corruption of quota file.
 
 Fix the problem by clearing dq_off when we are releasing dquot
 structure. We also remove the DB_READ_B handling because it is useless -
 DQ_ACTIVE_B is set iff DQ_READ_B is set.
 
 CC: sta...@vger.kernel.org
 CC: Goldwyn Rodrigues rgold...@suse.de
 CC: Mark Fasheh mfas...@suse.de
 Signed-off-by: Jan Kara j...@suse.cz

Thanks Jan, this looks good.

Reviewed-by: Mark Fasheh mfas...@suse.de
--Mark

--
Mark Fasheh

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash

2014-02-24 Thread Srinivas Eeda

Junxiao, thanks for looking into this issue. Please see my comment below

On 02/24/2014 01:07 AM, Junxiao Bi wrote:

Hi,

On 07/19/2012 09:59 AM, Sunil Mushran wrote:

Different issues.

On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi junxiao...@oracle.com 
mailto:junxiao...@oracle.com wrote:


On 07/19/2012 12:36 AM, Sunil Mushran wrote:

This bug was detected during code audit. Never seen a crash. If
it does hit,
then we have bigger problems. So no point posting to stable.


I read a lot of dlm recovery code recently, I found this bug could 
happen at the following scenario.


node 1: migrate target node x:
dlm_unregister_domain()
 dlm_migrate_all_locks()
  dlm_empty_lockres()
   select node x as migrate target node
   since there is a node x lock on the granted list.
   dlm_migrate_lockres()
dlm_mark_lockres_migrating() {
 wait_event(dlm-ast_wq, !dlm_lockres_is_dirty(dlm, res));
 node x unlock may happen here, res-granted list can be empty.
If the unlock request got sent at this point, and if the request was 
*processed*, lock must have been removed from the granted_list. If the 
request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set in 
dlm_lockres_release_ast would make dlm_unlock handler to return 
DLM_MIGRATING to the caller (in this case node x). So I don't see how 
granted_list could have stale lock. Am I missing something ?


I do think there is such race that you pointed below exist, but I am not 
sure if it was due to the above race described.



dlm_lockres_release_ast(dlm, res);
}
dlm_send_one_lockres()
dlm_process_recovery_data() {
tmpq is 
res-granted list and is empty.

list_for_each_entry(lock, tmpq, list) {
 if 
(lock-ml.cookie != ml-cookie)

  lock = NULL;
 else
  break;
}
lock will be 
invalid here.
if (lock-ml.node 
!= ml-node)
BUG() -- 
crash here.

   }

Thanks,
Junxiao.


Our customer can reproduce it. Also I saw you were assigned a
similar bug before, see
https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the
same BUG?


On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi
junxiao...@oracle.com mailto:junxiao...@oracle.com wrote:

Hi Sunil,

On 07/18/2012 03:49 AM, Sunil Mushran wrote:

On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi
junxiao...@oracle.com mailto:junxiao...@oracle.com wrote:

In the target node of the dlm lock migration, the logic
to find
the local dlm lock is wrong, it shouldn't change the
loop variable
lock in the list_for_each_entry loop. This will cause
a NULL-pointer
accessing crash.

Signed-off-by: Junxiao Bi junxiao...@oracle.com
mailto:junxiao...@oracle.com
Cc: sta...@vger.kernel.org mailto:sta...@vger.kernel.org
---
 fs/ocfs2/dlm/dlmrecovery.c |   12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmrecovery.c
b/fs/ocfs2/dlm/dlmrecovery.c
index 01ebfd0..0b9cc88 100644
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -1762,6 +1762,7 @@ static int
dlm_process_recovery_data(struct dlm_ctxt *dlm,
u8 from = O2NM_MAX_NODES;
unsigned int added = 0;
__be64 c;
+   int found;

mlog(0, running %d locks for this lockres\n,
mres-num_locks);
for (i=0; imres-num_locks; i++) {
@@ -1793,22 +1794,23 @@ static int
dlm_process_recovery_data(struct dlm_ctxt *dlm,
/* MIGRATION ONLY! */
BUG_ON(!(mres-flags  DLM_MRES_MIGRATION));

+   found = 0;
spin_lock(res-spinlock);
for (j = DLM_GRANTED_LIST; j =
DLM_BLOCKED_LIST; j++) {
tmpq =
dlm_list_idx_to_ptr(res, j);
list_for_each_entry(lock, tmpq, list) {
-   if
(lock-ml.cookie != ml-cookie)
-   lock = NULL;
- else
+   if
(lock-ml.cookie == ml-cookie) {
+   found = 1;
  

Re: [Ocfs2-devel] [PATCH 1/6] ocfs2: Remove OCFS2_INODE_SKIP_DELETE flag

2014-02-24 Thread Mark Fasheh
On Fri, Feb 21, 2014 at 10:44:59AM +0100, Jan Kara wrote:
 The flag was never set, delete it.
 
 Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com
 Signed-off-by: Jan Kara j...@suse.cz

 ok, that was easy :) 

Reviewed-by: Mark Fasheh mfas...@suse.de
--Mark

--
Mark Fasheh

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH 2/6] ocfs2: Move dquot_initialize() in ocfs2_delete_inode() somewhat later

2014-02-24 Thread Mark Fasheh
On Fri, Feb 21, 2014 at 10:45:00AM +0100, Jan Kara wrote:
 Move dquot_initalize() call in ocfs2_delete_inode() after the moment we
 verify inode is actually a sane one to delete. We certainly don't want
 to initialize quota for system inodes etc. This also avoids calling into
 quota code from downconvert thread.
 
 Add more details into the comment why bailing out from
 ocfs2_delete_inode() when we are in downconvert thread is OK.
 
 Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com
 Signed-off-by: Jan Kara j...@suse.cz
 ---
  fs/ocfs2/inode.c | 16 +---
  1 file changed, 9 insertions(+), 7 deletions(-)

Reviewed-by: Mark Fasheh mfas...@suse.de

--
Mark Fasheh

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash

2014-02-24 Thread Junxiao Bi
Hi Srini,

On 02/25/2014 07:30 AM, Srinivas Eeda wrote:
 Junxiao, thanks for looking into this issue. Please see my comment below

 On 02/24/2014 01:07 AM, Junxiao Bi wrote:
 Hi,

 On 07/19/2012 09:59 AM, Sunil Mushran wrote:
 Different issues.

 On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi junxiao...@oracle.com
 mailto:junxiao...@oracle.com wrote:

 On 07/19/2012 12:36 AM, Sunil Mushran wrote:
 This bug was detected during code audit. Never seen a crash. If
 it does hit,
 then we have bigger problems. So no point posting to stable.

 I read a lot of dlm recovery code recently, I found this bug could
 happen at the following scenario.

 node 1: migrate target
 node x:
 dlm_unregister_domain()
  dlm_migrate_all_locks()
   dlm_empty_lockres()
select node x as migrate target node
since there is a node x lock on the granted list.
dlm_migrate_lockres()
 dlm_mark_lockres_migrating() {
  wait_event(dlm-ast_wq, !dlm_lockres_is_dirty(dlm, res));
  node x unlock may happen here, res-granted list can be empty.
 If the unlock request got sent at this point, and if the request was
 *processed*, lock must have been removed from the granted_list. If the
 request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set
 in dlm_lockres_release_ast would make dlm_unlock handler to return
 DLM_MIGRATING to the caller (in this case node x). So I don't see how
 granted_list could have stale lock. Am I missing something ?
I agree granted_list will not have stale lock. The issue is triggered
when there is no locks in the granted_list. In migrate target node, the
granted_list is also empty after unlock. Then due to the wrong use of
list_for_each_entry in the following code, lock will be not null even
the granted_list is null. The lock is invalid and lock-ml.node !=
ml-node may be true and cause the bug.


 for (j = DLM_GRANTED_LIST; j =
DLM_BLOCKED_LIST; j++) {
tmpq = dlm_list_idx_to_ptr(res, j);
list_for_each_entry(lock, tmpq, list) {
if (lock-ml.cookie != ml-cookie)
lock = NULL;
else
break;
}
if (lock)
break;
}

/* lock is always created locally first, and
 * destroyed locally last.  it must be on the
list */
if (!lock) {
c = ml-cookie;
   
BUG();
}

if (lock-ml.node != ml-node) {
c = lock-ml.cookie;  
c = ml-cookie;   
BUG();
}

Thanks,
Junxiao.

 I do think there is such race that you pointed below exist, but I am
 not sure if it was due to the above race described.

  dlm_lockres_release_ast(dlm, res);
 }  
 dlm_send_one_lockres()
   
 dlm_process_recovery_data() {
 tmpq is
 res-granted list and is empty.

 list_for_each_entry(lock, tmpq, list) {
  if
 (lock-ml.cookie != ml-cookie)
   lock = NULL;
  else
   break;
 }  
 lock will be
 invalid here.
 if (lock-ml.node
 != ml-node)
 BUG() --
 crash here.
}

 Thanks,
 Junxiao.

 Our customer can reproduce it. Also I saw you were assigned a
 similar bug before, see
 https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the
 same BUG?

 On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi
 junxiao...@oracle.com mailto:junxiao...@oracle.com wrote:

 Hi Sunil,

 On 07/18/2012 03:49 AM, Sunil Mushran wrote:
 On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi
 junxiao...@oracle.com mailto:junxiao...@oracle.com wrote:

 In the target node of the dlm lock migration, the
 logic to find
 the local dlm lock is wrong, it shouldn't change the
 loop variable
 lock in the list_for_each_entry loop. This will
 cause a NULL-pointer
 

Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash

2014-02-24 Thread Junxiao Bi
On 02/25/2014 07:30 AM, Srinivas Eeda wrote:
 Junxiao, thanks for looking into this issue. Please see my comment below

 On 02/24/2014 01:07 AM, Junxiao Bi wrote:
 Hi,

 On 07/19/2012 09:59 AM, Sunil Mushran wrote:
 Different issues.

 On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi junxiao...@oracle.com
 mailto:junxiao...@oracle.com wrote:

 On 07/19/2012 12:36 AM, Sunil Mushran wrote:
 This bug was detected during code audit. Never seen a crash. If
 it does hit,
 then we have bigger problems. So no point posting to stable.

 I read a lot of dlm recovery code recently, I found this bug could
 happen at the following scenario.

 node 1: migrate target
 node x:
 dlm_unregister_domain()
  dlm_migrate_all_locks()
   dlm_empty_lockres()
select node x as migrate target node
since there is a node x lock on the granted list.
dlm_migrate_lockres()
 dlm_mark_lockres_migrating() {
  wait_event(dlm-ast_wq, !dlm_lockres_is_dirty(dlm, res));
  node x unlock may happen here, res-granted list can be empty.
 If the unlock request got sent at this point, and if the request was
 *processed*, lock must have been removed from the granted_list. If the
 request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set
 in dlm_lockres_release_ast would make dlm_unlock handler to return
 DLM_MIGRATING to the caller (in this case node x). So I don't see how
 granted_list could have stale lock. Am I missing something ?

 I do think there is such race that you pointed below exist, but I am
 not sure if it was due to the above race described.
Outside the windows from set RES_BLOCK_DIRTY flag and wait_event() to
dlm_lockres_release_ast(), granted_list can not be empty, since
wait_event will wait until dlm_thread clear the dirty flag where shuffle
list will pick another lock to the granted list. After the window,
DLM_MIGRATING flag will stop other node unlock to the granted list. So I
think this cause the empty granted list and cause the migrate target
panic. I didn't see any other harm of this since the migrate target node
will shuffle the list and send the ast message later.

Thanks,
Junxiao.

  dlm_lockres_release_ast(dlm, res);
 }  
 dlm_send_one_lockres()
   
 dlm_process_recovery_data() {
 tmpq is
 res-granted list and is empty.

 list_for_each_entry(lock, tmpq, list) {
  if
 (lock-ml.cookie != ml-cookie)
   lock = NULL;
  else
   break;
 }  
 lock will be
 invalid here.
 if (lock-ml.node
 != ml-node)
 BUG() --
 crash here.
}

 Thanks,
 Junxiao.

 Our customer can reproduce it. Also I saw you were assigned a
 similar bug before, see
 https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the
 same BUG?

 On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi
 junxiao...@oracle.com mailto:junxiao...@oracle.com wrote:

 Hi Sunil,

 On 07/18/2012 03:49 AM, Sunil Mushran wrote:
 On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi
 junxiao...@oracle.com mailto:junxiao...@oracle.com wrote:

 In the target node of the dlm lock migration, the
 logic to find
 the local dlm lock is wrong, it shouldn't change the
 loop variable
 lock in the list_for_each_entry loop. This will
 cause a NULL-pointer
 accessing crash.

 Signed-off-by: Junxiao Bi junxiao...@oracle.com
 mailto:junxiao...@oracle.com
 Cc: sta...@vger.kernel.org mailto:sta...@vger.kernel.org
 ---
  fs/ocfs2/dlm/dlmrecovery.c |   12 +++-
  1 file changed, 7 insertions(+), 5 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmrecovery.c
 b/fs/ocfs2/dlm/dlmrecovery.c
 index 01ebfd0..0b9cc88 100644
 --- a/fs/ocfs2/dlm/dlmrecovery.c
 +++ b/fs/ocfs2/dlm/dlmrecovery.c
 @@ -1762,6 +1762,7 @@ static int
 dlm_process_recovery_data(struct dlm_ctxt *dlm,
 u8 from = O2NM_MAX_NODES;
 unsigned int added = 0;
 __be64 c;
 +   int found;

 mlog(0, running %d locks for this lockres\n,
 mres-num_locks);
 for (i=0; imres-num_locks; i++) {
 @@ -1793,22 +1794,23 @@ static int