Re: [Ocfs2-devel] [PATCH 1/1] o2dlm: fix a race between purge and master query

2014-10-29 Thread Joseph Qi
We tested this patch and it works well.
Thanks.

Tested-by: Joseph Qi joseph...@huawei.com

On 2014/10/29 6:24, Srinivas Eeda wrote:
 Node A sends master query request to node B which is the master. At this time
 lockres happens to be on purgelist. dlm_master_request_handler gets the dlm
 spinlock, finds the resource and releases the dlm spin lock. Right at this
 dlm_thread on this node could purge the lockres. dlm_master_request_handler
 can then acquire lockres spinlock and reply to Node A that node B is the
 master even though lockres on node B is purged.
 
 The above scenario will now make node A falsely think node B is the master
 which is inconsistent. Further if another node C tries to master the same
 resource, every node will respond they are not the master. Node C then masters
 the resource and sends assert master to all nodes. This will now make node A
 crash with the following message.
 
 dlm_assert_master_handler:1831 ERROR: DIE! Mastery assert from 9, but current
 owner is 10!
 
 Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
 ---
  fs/ocfs2/dlm/dlmmaster.c | 12 
  1 file changed, 12 insertions(+)
 
 diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
 index 215e41a..3689b35 100644
 --- a/fs/ocfs2/dlm/dlmmaster.c
 +++ b/fs/ocfs2/dlm/dlmmaster.c
 @@ -1460,6 +1460,18 @@ way_up_top:
  
   /* take care of the easy cases up front */
   spin_lock(res-spinlock);
 +
 + /*
 +  * Right after dlm spinlock was released, dlm_thread could have
 +  * purged the lockres. Check if lockres got unhashed. If so
 +  * start over.
 +  */
 + if (hlist_unhashed(res-hash_node)) {
 + spin_unlock(res-spinlock);
 + dlm_lockres_put(res);
 + goto way_up_top;
 + }
 +
   if (res-state  (DLM_LOCK_RES_RECOVERING|
 DLM_LOCK_RES_MIGRATING)) {
   spin_unlock(res-spinlock);
 



___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


[Ocfs2-devel] [PATCH] ocfs2: o2net: fix connect expired

2014-10-29 Thread Junxiao Bi
Set nn_persistent_error to -ENOTCONN will stop reconnect since the
stop condition in o2net_start_connect() will be true.

stop = (nn-nn_sc ||
(nn-nn_persistent_error 
(nn-nn_persistent_error != -ENOTCONN || timeout == 0)));

This will make connection never be established if the first connection request
is lost.

Signed-off-by: Junxiao Bi junxiao...@oracle.com
---
 fs/ocfs2/cluster/tcp.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 97de0fb..4d6b645 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -1736,7 +1736,7 @@ static void o2net_connect_expired(struct work_struct 
*work)
 o2net_idle_timeout() / 1000,
 o2net_idle_timeout() % 1000);
 
-   o2net_set_nn_state(nn, NULL, 0, -ENOTCONN);
+   o2net_set_nn_state(nn, NULL, 0, 0);
}
spin_unlock(nn-nn_lock);
 }
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] ocfs2: o2net: fix connect expired

2014-10-29 Thread Srinivas Eeda
Junxiao,

can you please describe under what circumstances you saw this problem? 
My understanding is o2net_connect_expired is only queued when connection 
actually broke and ENOTCONN is the right error in that case.

Thanks,
--Srini

On 10/29/2014 06:41 PM, Junxiao Bi wrote:
 Set nn_persistent_error to -ENOTCONN will stop reconnect since the
 stop condition in o2net_start_connect() will be true.

 stop = (nn-nn_sc ||
   (nn-nn_persistent_error 
   (nn-nn_persistent_error != -ENOTCONN || timeout == 0)));

 This will make connection never be established if the first connection request
 is lost.

 Signed-off-by: Junxiao Bi junxiao...@oracle.com
 ---
   fs/ocfs2/cluster/tcp.c |2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
 index 97de0fb..4d6b645 100644
 --- a/fs/ocfs2/cluster/tcp.c
 +++ b/fs/ocfs2/cluster/tcp.c
 @@ -1736,7 +1736,7 @@ static void o2net_connect_expired(struct work_struct 
 *work)
o2net_idle_timeout() / 1000,
o2net_idle_timeout() % 1000);
   
 - o2net_set_nn_state(nn, NULL, 0, -ENOTCONN);
 + o2net_set_nn_state(nn, NULL, 0, 0);
   }
   spin_unlock(nn-nn_lock);
   }


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] ocfs2: o2net: fix connect expired

2014-10-29 Thread Junxiao Bi
Hi Srini,

On 10/30/2014 01:16 PM, Srinivas Eeda wrote:
 Junxiao,
 
 can you please describe under what circumstances you saw this problem?
 My understanding is o2net_connect_expired is only queued when connection
 actually broke and ENOTCONN is the right error in that case.
This happened when o2net was issuing the first connect request to some
node, but the request packet is lost due to some error like network
broken, then the connect would be expired, in o2net_connect_expired()
that set nn-nn_persistent_error to -ENOTCONN but timeout was zero, so
o2net_start_connect() would return without sending another connect
request, connection to the node will never be built.

Thanks,
Junxiao.

 
 Thanks,
 --Srini
 
 On 10/29/2014 06:41 PM, Junxiao Bi wrote:
 Set nn_persistent_error to -ENOTCONN will stop reconnect since the
 stop condition in o2net_start_connect() will be true.

 stop = (nn-nn_sc ||
 (nn-nn_persistent_error 
 (nn-nn_persistent_error != -ENOTCONN || timeout == 0)));

 This will make connection never be established if the first connection
 request
 is lost.

 Signed-off-by: Junxiao Bi junxiao...@oracle.com
 ---
   fs/ocfs2/cluster/tcp.c |2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
 index 97de0fb..4d6b645 100644
 --- a/fs/ocfs2/cluster/tcp.c
 +++ b/fs/ocfs2/cluster/tcp.c
 @@ -1736,7 +1736,7 @@ static void o2net_connect_expired(struct
 work_struct *work)
o2net_idle_timeout() / 1000,
o2net_idle_timeout() % 1000);
   -o2net_set_nn_state(nn, NULL, 0, -ENOTCONN);
 +o2net_set_nn_state(nn, NULL, 0, 0);
   }
   spin_unlock(nn-nn_lock);
   }
 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] ocfs2: o2net: fix connect expired

2014-10-29 Thread Srinivas Eeda
Hi Junxiao,

thanks for explaining. For this case allowing a reconnect (setting 
atomic_set(nn-nn_timeout, 1); ) in o2net_connect_expired should work ?

Thanks,
--Srini


On 10/29/2014 10:32 PM, Junxiao Bi wrote:
 Hi Srini,

 On 10/30/2014 01:16 PM, Srinivas Eeda wrote:
 Junxiao,

 can you please describe under what circumstances you saw this problem?
 My understanding is o2net_connect_expired is only queued when connection
 actually broke and ENOTCONN is the right error in that case.
 This happened when o2net was issuing the first connect request to some
 node, but the request packet is lost due to some error like network
 broken, then the connect would be expired, in o2net_connect_expired()
 that set nn-nn_persistent_error to -ENOTCONN but timeout was zero, so
 o2net_start_connect() would return without sending another connect
 request, connection to the node will never be built.

 Thanks,
 Junxiao.

 Thanks,
 --Srini

 On 10/29/2014 06:41 PM, Junxiao Bi wrote:
 Set nn_persistent_error to -ENOTCONN will stop reconnect since the
 stop condition in o2net_start_connect() will be true.

 stop = (nn-nn_sc ||
  (nn-nn_persistent_error 
  (nn-nn_persistent_error != -ENOTCONN || timeout == 0)));

 This will make connection never be established if the first connection
 request
 is lost.

 Signed-off-by: Junxiao Bi junxiao...@oracle.com
 ---
fs/ocfs2/cluster/tcp.c |2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
 index 97de0fb..4d6b645 100644
 --- a/fs/ocfs2/cluster/tcp.c
 +++ b/fs/ocfs2/cluster/tcp.c
 @@ -1736,7 +1736,7 @@ static void o2net_connect_expired(struct
 work_struct *work)
 o2net_idle_timeout() / 1000,
 o2net_idle_timeout() % 1000);
-o2net_set_nn_state(nn, NULL, 0, -ENOTCONN);
 +o2net_set_nn_state(nn, NULL, 0, 0);
}
spin_unlock(nn-nn_lock);
}


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel