Thanks Srini , This sounds good, I tried to use dlm_is_node_dead in this patch , but this function can't report another node is dead if this node already in recovery process. It was blocked to set the bit in domain_map, but the live_nodes_map could always reflect the really live nodes.
I will reformat the patch. Thanks, Xiaowei On 05/26/2012 06:17 AM, srinivas eeda wrote: > comments inline > > On 5/24/2012 10:53 PM, xiaowei...@oracle.com wrote: >> From: "Xiaowei.Hu"<xiaowei...@oracle.com> >> >> when the master requested locks ,but one/some of the live nodes died, >> after it received the request msg and before send out the locks >> packages, >> the recovery will fall into endless loop,waiting for the status >> changed to finalize >> >> NodeA NodeB >> selected as recovery master >> dlm_remaster_locks >> -> dlm_requeset_all_locks >> this send request locks msg to B >> received the msg from A, >> queue worker >> dlm_request_all_locks_worker >> return 0 >> go on set state to requested >> wait for the state become done >> NodeB lost connection due >> to network >> before the worker begin, >> or it die. >> NodeA still waiting for the >> change of reco state. >> It won't end if it not get data done msg >> And at this time nodeB do not realize this (or it just died), >> it won't send the msg for ever, nodeA left in the recovery process >> forever. >> >> This patch let the recovery master check if the node still in live node >> map when it stay in REQUESTED status. >> >> Signed-off-by: Xiaowei.Hu<xiaowei...@oracle.com> >> --- >> fs/ocfs2/dlm/dlmrecovery.c | 9 +++++++++ >> 1 files changed, 9 insertions(+), 0 deletions(-) >> >> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c >> index 01ebfd0..62659e8 100644 >> --- a/fs/ocfs2/dlm/dlmrecovery.c >> +++ b/fs/ocfs2/dlm/dlmrecovery.c >> @@ -555,6 +555,7 @@ static int dlm_remaster_locks(struct dlm_ctxt >> *dlm, u8 dead_node) >> int all_nodes_done; >> int destroy = 0; >> int pass = 0; >> + int dying = 0; >> >> do { >> /* we have become recovery master. there is no escaping >> @@ -659,6 +660,7 @@ static int dlm_remaster_locks(struct dlm_ctxt >> *dlm, u8 dead_node) >> list_for_each_entry(ndata,&dlm->reco.node_data, list) { >> mlog(0, "checking recovery state of node %u\n", >> ndata->node_num); >> + dying = 0; >> switch (ndata->state) { >> case DLM_RECO_NODE_DATA_INIT: >> case DLM_RECO_NODE_DATA_REQUESTING: >> @@ -679,6 +681,13 @@ static int dlm_remaster_locks(struct dlm_ctxt >> *dlm, u8 dead_node) >> dlm->name, ndata->node_num, >> ndata->state==DLM_RECO_NODE_DATA_RECEIVING ? >> "receiving" : "requested"); >> + spin_lock(&dlm->spinlock); >> + dying = !test_bit(ndata->node_num, >> dlm->live_nodes_map); >> + spin_unlock(&dlm->spinlock); >> + if (dying) { >> + ndata->state = DLM_RECO_NODE_DATA_DEAD; >> + break; >> + } >> all_nodes_done = 0; >> break; >> case DLM_RECO_NODE_DATA_DONE: > fix seems to address the issue, but can you please add a function > dlm_is_node_in_livemap similar to dlm_is_node_dead so that it' > improves readability. You can then add the following to check if the > node is still alive > + if (!dlm_is_node_in_livemap(dlm, ndata->node_num)) > + ndate->state = DLM_RECO_NODE_DATA_DEAD; > + else > + all_nodes_done = 0; _______________________________________________ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel