Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5

A V Mahesh Fri, 14 Apr 2017 02:20:05 -0700

Hi Hoang,

ACK , you can push.


 >>So I will continue checking it in separate ticket.

Please create a ticket for tracking.

-AVM


On 4/14/2017 2:14 PM, Vo Minh Hoang wrote:
> Dear Mahesh,
>
> Thank you for your comments.
> I add 2 of my ideals inline, please find [Hoang] tags.
>
> Dear Zoran,
>
> Do you have any extra comment about this patch?
> If not, I will request pushing it at start of next week.
>
> Sincerely,
> Hoang
>
> -----Original Message-----
> From: A V Mahesh [mailto:mahesh.va...@oracle.com]
> Sent: Thursday, April 13, 2017 5:47 PM
> To: Vo Minh Hoang <hoang.m...@dektech.com.au>; zoran.milinko...@ericsson.com
> Cc: opensaf-devel@lists.sourceforge.net; Ramesh Babu Betham
> <ramesh.bet...@oracle.com>
> Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv
> [#1765] V5
>
> Hi Hoang,
>
> ACK with following : ( tested basic ND restarts)
>
> - The below errors are not related this patch, those are test case related
>
> - It look their a existing issue ( not related to this patch ) on Cpnd down
> the  STANDBY Cpd is
>     also starting `cpd_tmr_start(&node_info->cpnd_ret_timer,..);`  please
> check that flow once
>     (after cpnd restart keep some sleep Actvie CPD and do a switch over )
> [Hoang]: I also can reproduce this behavior but could not find error.
> So I will continue checking it in separate ticket.
> It is a little bit weird that standby cpd trigger something. Honestly I
> think standby should do data sync only. Btw, that is too soon to talk about
> this case.
>
> -  You introduced cpd_tmr_stop(&cpnd_info->cpnd_ret_timer);  in
> cpnd_down_process()
>      but cpnd_up_process()  do call
> `cpd_tmr_stop(&cpnd_info->cpnd_ret_timer);`
>      do check that it may be redundant call .
> [Hoang]: I thought we should keep this call even it is redundant in this
> case. We are detecting more and more unexpected error cases in system and
> cannot tell for sure it is redundant or not.
>
> -AVM
>
> On 4/12/2017 2:19 PM, A V Mahesh wrote:
>> Hi Hoang,
>>
>> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote:
>>> If cpnd is temporary down only, we don't need clean up anything.
>>> If cpnd is permanently down, the bad effect of this proposal is that
>>> replica is not clean up. But if cpnd permanently down, we have to
>>> reboot node for recovering so I think this cleanup is not really
>>> necessary.
>>>
>>> I also checked this implementation with possible test cases and have
>>> not seen any side effect.
>>> Please consider it
>> We are observing new node_user_info  databases mismatch Errors, while
>> testing multiple CPND restart with this patch,I will do more debugging
>> and update the root cause.
>>
>> ======================================================================
>> =====================================================
>>
>>
>> Apr 12 14:06:57 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start
>> CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 *Apr 12
>> 14:06:58 SC-1 osafckptd[27594]: ER cpd_proc_decrease_node_user_info
>> failed - no user on node id 0x2020F* Apr 12 14:06:58 SC-1
>> osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id
>> = 0x7f86f0501750, arg=0x7f86f0501ef0 *Apr 12 14:06:59 SC-1
>> osafckptd[27594]: ER cpd_proc_decrease_node_user_info failed - no user
>> on node id 0x2020F* Apr 12 14:06:59 SC-1 osafckptd[27594]: NO
>> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0503ab0,
>> arg=0x7f86f0501ef0 Apr 12 14:07:00 SC-1 osafckptd[27594]: NO
>> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500c70,
>> arg=0x7f86f0501ef0 Apr 12 14:07:01 SC-1 osafckptd[27594]: NO
>> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500930,
>> arg=0x7f86f0501ef0 *Apr 12 14:07:03 SC-1 osafckptd[27594]: ER
>> cpd_proc_decrease_node_user_info failed - no user on node id 0x2020*F
>> Apr 12 14:07:03 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start
>> CPND_RETENTION timer id = 0x7f86f04fe3a0, arg=0x7f86f0501ef0 Apr 12
>> 14:07:04 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start
>> CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0
>>
>> ======================================================================
>> =====================================================
>>
>>
>> -AVM
>>
>>
>> On 4/12/2017 11:08 AM, A V Mahesh wrote:
>>> Hi Hoang,
>>>
>>> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote:
>>>> Dear Mahesh,
>>>>
>>>> Based on what I saw, in this case, retention time cannot detect CPND
>>>> temporarily down because its pid changed.
>>> I will check that , I have some test cases based this retention time
>>> , not sure how were they working.
>>>
>>> Can you please provide reproducible steps, I did look at ticket , but
>>> looks complex , if you have any application that reproduces the case
>>> please share.
>>>
>>> -AVM
>>>> If cpnd is temporary down only, we don't need clean up anything.
>>>> If cpnd is permanently down, the bad effect of this proposal is that
>>>> replica is not clean up. But if cpnd permanently down, we have to
>>>> reboot node for recovering so I think this cleanup is not really
>>>> necessary.
>>>>
>>>> I also checked this implementation with possible test cases and have
>>>> not seen any side effect.
>>>> Please consider it.
>>>>
>>>> Thank you and best regards,
>>>> Hoang
>>>>
>>>> -----Original Message-----
>>>> From: A V Mahesh [mailto:mahesh.va...@oracle.com]
>>>> Sent: Friday, February 10, 2017 10:40 AM
>>>> To: Hoang Vo <hoang.m...@dektech.com.au>;
>>>> zoran.milinko...@ericsson.com
>>>> Cc: opensaf-devel@lists.sourceforge.net
>>>> Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of
>>>> cpsv [#1765] V5
>>>>
>>>> Hi Hoang,
>>>>
>>>> The CPD_CPND_DOWN_RETENTION  is to recognize, ether CPND temporarily
>>>> down or permanently down, this is started a CPND is down and based
>>>> on cpd_evt_proc_timer_expiry(), cpd recognize that the CPND is
>>>> complete down and do cleanup, else  cpnd rejoined with in
>>>> CPD_CPND_DOWN_RETENTION_TIME , the CPD_CPND_DOWN_RETENTION is
>>>> stoped.
>>>>
>>>> If we stop CPD_CPND_DOWN_RETENTION timer in cpd_process_cpnd_dow(),
>>>> do cpd recognize the CPD permanently down, the
>>>> cpd_process_cpnd_dow() being called in multiple flows, can you
>>>> please check all the flows, is stopping CPD_CPND_DOWN_RETENTION
>>>> timer has any impact ?
>>>>
>>>> -AVM
>>>>
>>>> On 2/9/2017 1:35 PM, Hoang Vo wrote:
>>>>>     src/ckpt/ckptd/cpd_proc.c |  11 ++++++++++-
>>>>>     1 files changed, 10 insertions(+), 1 deletions(-)
>>>>>
>>>>>
>>>>> problem:
>>>>> In case failover multiple times, the cpnd is down for a moment so
>>>>> there is no cpnd opening specific checkpoint. This lead to
>>>>> retention timer
>>>> is trigger.
>>>>> When cpnd is up again but has different pid so retention timer is
>>>>> not
>>>> stoped.
>>>>> Repica is deleted at retention while its information still be in
>>>>> ckpt
>>>> database.
>>>>> That cause problem
>>>>>
>>>>> Fix:
>>>>> - Stop timer of removed node.
>>>>> - Update data in patricia trees (for retention value consistence).
>>>>>
>>>>> diff --git a/src/ckpt/ckptd/cpd_proc.c b/src/ckpt/ckptd/cpd_proc.c
>>>>> --- a/src/ckpt/ckptd/cpd_proc.c
>>>>> +++ b/src/ckpt/ckptd/cpd_proc.c
>>>>> @@ -679,7 +679,8 @@ uint32_t cpd_process_cpnd_down(CPD_CB *c
>>>>>         cpd_cpnd_info_node_find_add(&cb->cpnd_tree, cpnd_dest,
>>>>> &cpnd_info,
>>>> &add_flag);
>>>>>         if (!cpnd_info)
>>>>>             return NCSCC_RC_SUCCESS;
>>>>> -
>>>>> +    /* Stop timer before processing down */
>>>>> +    cpd_tmr_stop(&cpnd_info->cpnd_ret_timer);
>>>>>         cref_info = cpnd_info->ckpt_ref_list;
>>>>>            while (cref_info) {
>>>>> @@ -989,6 +990,14 @@ uint32_t cpd_proc_retention_set(CPD_CB *
>>>>>            /* Update the retention Time */
>>>>>         (*ckpt_node)->ret_time = reten_time;
>>>>> +    (*ckpt_node)->attributes.retentionDuration = reten_time;
>>>>> +
>>>>> +    /* Update the related patricia tree */
>>>>> +    CPD_CKPT_MAP_INFO *map_info = NULL;
>>>>> +    cpd_ckpt_map_node_get(&cb->ckpt_map_tree,
>>>>> (*ckpt_node)->ckpt_name,
>>>> &map_info);
>>>>> +    if (map_info) {
>>>>> +        map_info->attributes.retentionDuration = reten_time;
>>>>> +    }
>>>>>         return rc;
>>>>>     }
>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5

Reply via email to