Hi Minh, One minor correction is still needed. node_up event comes very early. In case atleast one node up event has come from all amfnds then AMFD stops Node sync timer very early even before cluster timer has started: if (rc_node_up == sync_nd_size) { if (cb->node_sync_tmr.is_active) { avd_stop_tmr(cb, &cb->node_sync_tmr); TRACE("stop NodeSync timer"); } cb->all_nodes_synced = true;
But AMFD does not process all these node up event because it is not in INIT state by this time: if ((n2d_msg->msg_info.n2d_node_up.node_id != cb->node_id_avd) && (cb->init_state < AVD_INIT_DONE)) { TRACE("invalid init state (%u), node %x", cb->init_state, n2d_msg->msg_info.n2d_node_up.node_id); goto done; } When AMFD moves to INIT state it starts the cluster startup timer. If this is very low then it will expire and it will see node sync timer not running and it will send led messages. By this some nodes may be in syncing state. So already synced amfnds will be sending the assignment messages when some amfnds are still syncing and they may host some SUs. we need to bring two things at same level: 1)When led set message is sent to amfnds, all amfnds should be in PRESENT state. Means theere SUs are enabled and amfnd can process assignments. 2)All other amfnd which do not join till this time will be rebooted. Problem: How to hold amfnds from sending assignment events from buffer until all of them are in PRESENT state and non PRESENT will never join the cluster. Neither Node sync timer nor cluster startup timer ensures that all amfnds are synced and non synced will be rebooted. One idea: What if we do not stop the node sync timer when "if (rc_node_up == sync_nd_size)" is hit and mark node_sync_window_closed true as done when it really expires in avd_node_sync_tmr_evh()? What are the implications of it. But AMFD will still have to make sure fresh node_up_event() sending nodes will be rebooted. So if cluster timer expires early then it will see node sync timer running and will not send led set. But will all the payloads really move to PRESENT state in 10 seconds, I am relying on the chosen value of 10 seconds. Thanks, Praveen On 24-Aug-16 5:35 PM, praveen malviya wrote: > Hi Minh, > > Any assignment message should be processed after cluster timer expiry > and node sync timer expiry. The bug fix patch > 1725_02_V2_bugfix_resend_buffer_in_set_leds.diff honors cluster timer > expiry but not node sync timer. > After node sync timer expiry, delayed payloads will be rebooted and if > these payloads host any SU/SUSIs, they will be deleted. So admin op will > finish gracefully. > I think for loop can be added in both timers' expiry events with a check > on exipry of other timer: > > diff --git a/osaf/services/saf/amf/amfd/cluster.cc > b/osaf/services/saf/amf/amfd/cluster.cc > --- a/osaf/services/saf/amf/amfd/cluster.cc > +++ b/osaf/services/saf/amf/amfd/cluster.cc > @@ -74,12 +74,13 @@ void avd_cluster_tmr_init_evh(AVD_CL_CB > m_AVSV_SEND_CKPT_UPDT_ASYNC_UPDT(cb, cb, AVSV_CKPT_AVD_CB_CONFIG); > > // Resend set_leds to veteran node > - > - for (std::map<std::string, AVD_AVND *>::const_iterator it = > node_name_db->begin(); > - it != node_name_db->end(); it++) { > - node = it->second; > - if (node->veteran) > - avd_snd_set_leds_msg(cb, node); > + if (cb->node_sync_tmr.is_active == false) { > + for (std::map<std::string, AVD_AVND *>::const_iterator > it = node_name_db->begin(); > + it != node_name_db->end(); it++) { > + node = it->second; > + if (node->veteran) > + avd_snd_set_leds_msg(cb, node); > + } > } > > /* call the realignment routine for each of the SGs in the > @@ -143,6 +144,17 @@ void avd_node_sync_tmr_evh(AVD_CL_CB *cb > // Setting true here to indicate the node sync window has closed > // Further node up message will be treated specially > cb->node_sync_window_closed = true; > + // Resend set_leds to veteran node > + if (cb->amf_init_tmr.is_active == false) { > + AVD_AVND *node = nullptr; > + for (std::map<std::string, AVD_AVND *>::const_iterator > it = node_name_db->begin(); > + it != node_name_db->end(); it++) { > + node = it->second; > + if (node->veteran) > + avd_snd_set_leds_msg(cb, node); > + } > + } > + > > TRACE_LEAVE(); > } > > > > Thanks, > Praveen > On 24-Aug-16 4:58 PM, Nagendra Kumar wrote: >> The below is the assignments after the test case (SU2 has standby >> assignment): >> >> PM_SC-1:/home/nagu/views/staging-1725 # /etc/init.d/opensafd status >> safSISU=safSu=SU2\,safSg=AmfDemo_2N\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1 >> >> saAmfSISUHAState=STANDBY(2) >> safSISU=safSu=PL-3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed3,safApp=OpenSAF >> >> saAmfSISUHAState=ACTIVE(1) >> safSISU=safSu=PL-4\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF >> >> saAmfSISUHAState=ACTIVE(1) >> safSISU=safSu=SC-2\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed4,safApp=OpenSAF >> >> saAmfSISUHAState=ACTIVE(1) >> safSISU=safSu=SC-1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF >> >> saAmfSISUHAState=ACTIVE(1) >> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >> saAmfSISUHAState=STANDBY(2) >> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >> saAmfSISUHAState=ACTIVE(1) >> >> Thanks >> -Nagu >> >>> -----Original Message----- >>> From: Nagendra Kumar >>> Sent: 24 August 2016 16:55 >>> To: Minh Hon Chau; hans.nordeb...@ericsson.com; Praveen Malviya; >>> gary....@dektech.com.au; long.hb.ngu...@dektech.com.au >>> Cc: opensaf-devel@lists.sourceforge.net >>> Subject: Re: [devel] [PATCH 2 of 2] AMFND: Admin operation >>> continuation if >>> csi callback completes during headless [#1725 part 1] V1 >>> >>> Hi Minh, >>> With 1725_phase_1_V2.tgz, the below email TC has failed. Please >>> find the traces attached along with the configuration in the ticket. >>> >>> Thanks > ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel