Re: [Ocfs2-devel] [PATCH 1/1] o2dlm: fix a race between purge and master query
We tested this patch and it works well. Thanks. Tested-by: Joseph Qi joseph...@huawei.com On 2014/10/29 6:24, Srinivas Eeda wrote: Node A sends master query request to node B which is the master. At this time lockres happens to be on purgelist. dlm_master_request_handler gets the dlm spinlock, finds the resource and releases the dlm spin lock. Right at this dlm_thread on this node could purge the lockres. dlm_master_request_handler can then acquire lockres spinlock and reply to Node A that node B is the master even though lockres on node B is purged. The above scenario will now make node A falsely think node B is the master which is inconsistent. Further if another node C tries to master the same resource, every node will respond they are not the master. Node C then masters the resource and sends assert master to all nodes. This will now make node A crash with the following message. dlm_assert_master_handler:1831 ERROR: DIE! Mastery assert from 9, but current owner is 10! Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlm/dlmmaster.c | 12 1 file changed, 12 insertions(+) diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c index 215e41a..3689b35 100644 --- a/fs/ocfs2/dlm/dlmmaster.c +++ b/fs/ocfs2/dlm/dlmmaster.c @@ -1460,6 +1460,18 @@ way_up_top: /* take care of the easy cases up front */ spin_lock(res-spinlock); + + /* + * Right after dlm spinlock was released, dlm_thread could have + * purged the lockres. Check if lockres got unhashed. If so + * start over. + */ + if (hlist_unhashed(res-hash_node)) { + spin_unlock(res-spinlock); + dlm_lockres_put(res); + goto way_up_top; + } + if (res-state (DLM_LOCK_RES_RECOVERING| DLM_LOCK_RES_MIGRATING)) { spin_unlock(res-spinlock); ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH] ocfs2: o2net: fix connect expired
Set nn_persistent_error to -ENOTCONN will stop reconnect since the stop condition in o2net_start_connect() will be true. stop = (nn-nn_sc || (nn-nn_persistent_error (nn-nn_persistent_error != -ENOTCONN || timeout == 0))); This will make connection never be established if the first connection request is lost. Signed-off-by: Junxiao Bi junxiao...@oracle.com --- fs/ocfs2/cluster/tcp.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c index 97de0fb..4d6b645 100644 --- a/fs/ocfs2/cluster/tcp.c +++ b/fs/ocfs2/cluster/tcp.c @@ -1736,7 +1736,7 @@ static void o2net_connect_expired(struct work_struct *work) o2net_idle_timeout() / 1000, o2net_idle_timeout() % 1000); - o2net_set_nn_state(nn, NULL, 0, -ENOTCONN); + o2net_set_nn_state(nn, NULL, 0, 0); } spin_unlock(nn-nn_lock); } -- 1.7.9.5 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: o2net: fix connect expired
Junxiao, can you please describe under what circumstances you saw this problem? My understanding is o2net_connect_expired is only queued when connection actually broke and ENOTCONN is the right error in that case. Thanks, --Srini On 10/29/2014 06:41 PM, Junxiao Bi wrote: Set nn_persistent_error to -ENOTCONN will stop reconnect since the stop condition in o2net_start_connect() will be true. stop = (nn-nn_sc || (nn-nn_persistent_error (nn-nn_persistent_error != -ENOTCONN || timeout == 0))); This will make connection never be established if the first connection request is lost. Signed-off-by: Junxiao Bi junxiao...@oracle.com --- fs/ocfs2/cluster/tcp.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c index 97de0fb..4d6b645 100644 --- a/fs/ocfs2/cluster/tcp.c +++ b/fs/ocfs2/cluster/tcp.c @@ -1736,7 +1736,7 @@ static void o2net_connect_expired(struct work_struct *work) o2net_idle_timeout() / 1000, o2net_idle_timeout() % 1000); - o2net_set_nn_state(nn, NULL, 0, -ENOTCONN); + o2net_set_nn_state(nn, NULL, 0, 0); } spin_unlock(nn-nn_lock); } ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: o2net: fix connect expired
Hi Srini, On 10/30/2014 01:16 PM, Srinivas Eeda wrote: Junxiao, can you please describe under what circumstances you saw this problem? My understanding is o2net_connect_expired is only queued when connection actually broke and ENOTCONN is the right error in that case. This happened when o2net was issuing the first connect request to some node, but the request packet is lost due to some error like network broken, then the connect would be expired, in o2net_connect_expired() that set nn-nn_persistent_error to -ENOTCONN but timeout was zero, so o2net_start_connect() would return without sending another connect request, connection to the node will never be built. Thanks, Junxiao. Thanks, --Srini On 10/29/2014 06:41 PM, Junxiao Bi wrote: Set nn_persistent_error to -ENOTCONN will stop reconnect since the stop condition in o2net_start_connect() will be true. stop = (nn-nn_sc || (nn-nn_persistent_error (nn-nn_persistent_error != -ENOTCONN || timeout == 0))); This will make connection never be established if the first connection request is lost. Signed-off-by: Junxiao Bi junxiao...@oracle.com --- fs/ocfs2/cluster/tcp.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c index 97de0fb..4d6b645 100644 --- a/fs/ocfs2/cluster/tcp.c +++ b/fs/ocfs2/cluster/tcp.c @@ -1736,7 +1736,7 @@ static void o2net_connect_expired(struct work_struct *work) o2net_idle_timeout() / 1000, o2net_idle_timeout() % 1000); -o2net_set_nn_state(nn, NULL, 0, -ENOTCONN); +o2net_set_nn_state(nn, NULL, 0, 0); } spin_unlock(nn-nn_lock); } ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: o2net: fix connect expired
Hi Junxiao, thanks for explaining. For this case allowing a reconnect (setting atomic_set(nn-nn_timeout, 1); ) in o2net_connect_expired should work ? Thanks, --Srini On 10/29/2014 10:32 PM, Junxiao Bi wrote: Hi Srini, On 10/30/2014 01:16 PM, Srinivas Eeda wrote: Junxiao, can you please describe under what circumstances you saw this problem? My understanding is o2net_connect_expired is only queued when connection actually broke and ENOTCONN is the right error in that case. This happened when o2net was issuing the first connect request to some node, but the request packet is lost due to some error like network broken, then the connect would be expired, in o2net_connect_expired() that set nn-nn_persistent_error to -ENOTCONN but timeout was zero, so o2net_start_connect() would return without sending another connect request, connection to the node will never be built. Thanks, Junxiao. Thanks, --Srini On 10/29/2014 06:41 PM, Junxiao Bi wrote: Set nn_persistent_error to -ENOTCONN will stop reconnect since the stop condition in o2net_start_connect() will be true. stop = (nn-nn_sc || (nn-nn_persistent_error (nn-nn_persistent_error != -ENOTCONN || timeout == 0))); This will make connection never be established if the first connection request is lost. Signed-off-by: Junxiao Bi junxiao...@oracle.com --- fs/ocfs2/cluster/tcp.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c index 97de0fb..4d6b645 100644 --- a/fs/ocfs2/cluster/tcp.c +++ b/fs/ocfs2/cluster/tcp.c @@ -1736,7 +1736,7 @@ static void o2net_connect_expired(struct work_struct *work) o2net_idle_timeout() / 1000, o2net_idle_timeout() % 1000); -o2net_set_nn_state(nn, NULL, 0, -ENOTCONN); +o2net_set_nn_state(nn, NULL, 0, 0); } spin_unlock(nn-nn_lock); } ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel