I don't have any good ideas right now. The old check isn't reliable, but we can 
think about it for later.

It should be a rare case anyway, and 10s is probably not that noticeable when 
you consider start up times of other apps.

> On 30 Oct 2018, at 6:29 pm, Minh Hon Chau <minh.c...@dektech.com.au> wrote:
> 
> Hi Nagu,
> 
> It's fine with me too, or Gary knows how to exclude the wait of SC-2.
> 
> Thanks
> 
> Minh
> 
>> On 30/10/18 6:14 pm, Nagendra Kumar wrote:
>> Hi Minh,
>>          I thought, it would be rare. But if you find that it is breaking 
>> your existing functionality(backward compatibility) i.e. delaying the 
>> Cluster Startup i.e. earlier it used to take say 2 seconds, now it takes 10 
>> seconds with one controller, then Gary can retain the older code. It is fine 
>> with me.
>> 
>> Thanks
>> -Nagu
>> 
>> -----Original Message-----
>> From: Minh Hon Chau [mailto:minh.c...@dektech.com.au]
>> Sent: 30 October 2018 12:40
>> To: Nagendra Kumar; 'Gary Lee'; hans.nordeb...@ericsson.com
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: Re: [PATCH 1/1] amfd: ensure node_sync_window_closed is set [#2946]
>> 
>> Hi Nagu,
>> 
>> The subsequent procedures will be delayed, including the assignments
>> too, to wait for SC2 to join, it is nearly 10 secs I guess. The headless
>> sync does not need to wait for standby amfd, so here the code was not
>> expecting to wait for SC-2. I think this scenario is rare?
>> 
>> Thanks
>> 
>> Minh
>> 
>> 
>>> On 30/10/18 4:43 pm, Nagendra Kumar wrote:
>>> Hi Minh,
>>>          I had noticed that point while review. But, if both SCs have gone 
>>> down, then expected is both should join.
>>> If only one SC starts, then yes timeout will happen. Do you see any major 
>>> implications than assignments delay, which I think should be fine because, 
>>> the expected delay is waiting for SC-2 to join?
>>> 
>>> Thanks
>>> -Nagu
>>> 
>>> -----Original Message-----
>>> From: Minh Hon Chau [mailto:minh.c...@dektech.com.au]
>>> Sent: 30 October 2018 02:41
>>> To: Nagendra Kumar; 'Gary Lee'; hans.nordeb...@ericsson.com
>>> Cc: opensaf-devel@lists.sourceforge.net
>>> Subject: Re: [PATCH 1/1] amfd: ensure node_sync_window_closed is set [#2946]
>>> 
>>> Hi Gary, Nagu
>>> 
>>> One notice you may know from the patch.
>>> 
>>> If we have two SCs cluster, go headless, only start SC1, now the
>>> headless sync will be always timeout to wait for SC2 up.
>>> 
>>> Thanks
>>> 
>>> Minh
>>> 
>>>> On 29/10/18 7:19 pm, Nagendra Kumar wrote:
>>>> Hi Gary,
>>>>             Great simplification!. Ack.
>>>> 
>>>> Thanks
>>>> -Nagu
>>>> 
>>>> -----Original Message-----
>>>> From: Gary Lee [mailto:gary....@dektech.com.au]
>>>> Sent: 29 October 2018 12:36
>>>> To: minh.c...@dektech.com.au; hans.nordeb...@ericsson.com; Nagendra Kumar
>>>> Cc: opensaf-devel@lists.sourceforge.net; Gary Lee
>>>> Subject: [PATCH 1/1] amfd: ensure node_sync_window_closed is set [#2946]
>>>> 
>>>> If all nodes are synced after headless, the timer is stopped
>>>> but node_sync_window_closed is never set to true.
>>>> 
>>>> Later on, if a node becomes split from the main network and
>>>> rejoins, it will send a headless sync to amfd.
>>>> 
>>>> amfd will go into a never ending loop of processing the message,
>>>> putting back into the queue, etc.
>>>> 
>>>> When the node sync timer is stopped, ensure node_sync_window_closed
>>>> is set.
>>>> 
>>>> Also modify avd_count_node_up() not to count standby SC.
>>>> Sometimes a node_up from the standby SC arrives before mds up,
>>>> and the stadnby SC is incorrectly included in the node sync
>>>> count. Then a legitimate node_up from a PL is not accepted
>>>> because node_sync_window_closed is prematurely set.
>>>> ---
>>>>    src/amf/amfd/ndfsm.cc | 28 +++-------------------------
>>>>    1 file changed, 3 insertions(+), 25 deletions(-)
>>>> 
>>>> diff --git a/src/amf/amfd/ndfsm.cc b/src/amf/amfd/ndfsm.cc
>>>> index edc993988..375c5c7b1 100644
>>>> --- a/src/amf/amfd/ndfsm.cc
>>>> +++ b/src/amf/amfd/ndfsm.cc
>>>> @@ -165,34 +165,12 @@ done:
>>>>     *
>>>>    
>>>> **************************************************************************/
>>>>    uint32_t avd_count_sync_node_size(AVD_CL_CB *cb) {
>>>> -  uint32_t twon_ncs_su_count = 0;
>>>>      uint32_t count = 0;
>>>>      TRACE_ENTER();
>>>>    -  for (const auto &value : *node_name_db) {
>>>> -    AVD_AVND *avnd = value.second;
>>>> -    osafassert(avnd);
>>>> -    for (const auto &su : avnd->list_of_ncs_su) {
>>>> -      if (su->sg_of_su->sg_redundancy_model == SA_AMF_2N_REDUNDANCY_MODEL)
>>>> {
>>>> -        twon_ncs_su_count++;
>>>> -        continue;
>>>> -      }
>>>> -    }
>>>> -  }
>>>> -  // cluster can have 1 SC or more SCs which hosting 2N Opensaf SU
>>>> -  // so twon_ncs_su_count at least is 1
>>>> -  osafassert(twon_ncs_su_count > 0);
>>>> -
>>>> -  if (twon_ncs_su_count == 1) {
>>>> -    // 1 SC, the rest of nodes could be in sync from headless
>>>> -    count = node_name_db->size() - 1;
>>>> -  } else {
>>>> -    // >=2 SCs, the rest of nodes could be in sync except active/standby 
>>>> SC
>>>> -    count = node_name_db->size() - 2;
>>>> -  }
>>>> +  count = node_name_db->size() - 1;
>>>>          TRACE("sync node size:%d", count);
>>>> -  TRACE_LEAVE();
>>>>      return count;
>>>>    }
>>>>    
>>>> /***************************************************************************
>>>> **
>>>> @@ -218,8 +196,7 @@ uint32_t avd_count_node_up(AVD_CL_CB *cb) {
>>>>      for (const auto &value : *node_name_db) {
>>>>        node = value.second;
>>>>        if (node->node_up_msg_count > 0 &&
>>>> -        node->node_info.nodeId != cb->node_id_avd &&
>>>> -        node->node_info.nodeId != cb->node_id_avd_other)
>>>> +        node->node_info.nodeId != cb->node_id_avd)
>>>>          ++received_count;
>>>>      }
>>>>      TRACE("Number of node director(s) that director received node_up 
>>>> msg:%u",
>>>> @@ -329,6 +306,7 @@ void avd_node_up_evh(AVD_CL_CB *cb, AVD_EVT *evt) {
>>>>          if (cb->node_sync_tmr.is_active) {
>>>>            avd_stop_tmr(cb, &cb->node_sync_tmr);
>>>>            TRACE("stop NodeSync timer");
>>>> +        cb->node_sync_window_closed = true;
>>>>          }
>>>>          cb->all_nodes_synced = true;
>>>>          LOG_NO("Received node_up_msg from all nodes");
>> 



_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to