Hi Hoang, ACK , you can push.
>>So I will continue checking it in separate ticket. Please create a ticket for tracking. -AVM On 4/14/2017 2:14 PM, Vo Minh Hoang wrote: > Dear Mahesh, > > Thank you for your comments. > I add 2 of my ideals inline, please find [Hoang] tags. > > Dear Zoran, > > Do you have any extra comment about this patch? > If not, I will request pushing it at start of next week. > > Sincerely, > Hoang > > -----Original Message----- > From: A V Mahesh [mailto:mahesh.va...@oracle.com] > Sent: Thursday, April 13, 2017 5:47 PM > To: Vo Minh Hoang <hoang.m...@dektech.com.au>; zoran.milinko...@ericsson.com > Cc: opensaf-devel@lists.sourceforge.net; Ramesh Babu Betham > <ramesh.bet...@oracle.com> > Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv > [#1765] V5 > > Hi Hoang, > > ACK with following : ( tested basic ND restarts) > > - The below errors are not related this patch, those are test case related > > - It look their a existing issue ( not related to this patch ) on Cpnd down > the STANDBY Cpd is > also starting `cpd_tmr_start(&node_info->cpnd_ret_timer,..);` please > check that flow once > (after cpnd restart keep some sleep Actvie CPD and do a switch over ) > [Hoang]: I also can reproduce this behavior but could not find error. > So I will continue checking it in separate ticket. > It is a little bit weird that standby cpd trigger something. Honestly I > think standby should do data sync only. Btw, that is too soon to talk about > this case. > > - You introduced cpd_tmr_stop(&cpnd_info->cpnd_ret_timer); in > cpnd_down_process() > but cpnd_up_process() do call > `cpd_tmr_stop(&cpnd_info->cpnd_ret_timer);` > do check that it may be redundant call . > [Hoang]: I thought we should keep this call even it is redundant in this > case. We are detecting more and more unexpected error cases in system and > cannot tell for sure it is redundant or not. > > -AVM > > On 4/12/2017 2:19 PM, A V Mahesh wrote: >> Hi Hoang, >> >> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote: >>> If cpnd is temporary down only, we don't need clean up anything. >>> If cpnd is permanently down, the bad effect of this proposal is that >>> replica is not clean up. But if cpnd permanently down, we have to >>> reboot node for recovering so I think this cleanup is not really >>> necessary. >>> >>> I also checked this implementation with possible test cases and have >>> not seen any side effect. >>> Please consider it >> We are observing new node_user_info databases mismatch Errors, while >> testing multiple CPND restart with this patch,I will do more debugging >> and update the root cause. >> >> ====================================================================== >> ===================================================== >> >> >> Apr 12 14:06:57 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start >> CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 *Apr 12 >> 14:06:58 SC-1 osafckptd[27594]: ER cpd_proc_decrease_node_user_info >> failed - no user on node id 0x2020F* Apr 12 14:06:58 SC-1 >> osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id >> = 0x7f86f0501750, arg=0x7f86f0501ef0 *Apr 12 14:06:59 SC-1 >> osafckptd[27594]: ER cpd_proc_decrease_node_user_info failed - no user >> on node id 0x2020F* Apr 12 14:06:59 SC-1 osafckptd[27594]: NO >> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0503ab0, >> arg=0x7f86f0501ef0 Apr 12 14:07:00 SC-1 osafckptd[27594]: NO >> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500c70, >> arg=0x7f86f0501ef0 Apr 12 14:07:01 SC-1 osafckptd[27594]: NO >> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500930, >> arg=0x7f86f0501ef0 *Apr 12 14:07:03 SC-1 osafckptd[27594]: ER >> cpd_proc_decrease_node_user_info failed - no user on node id 0x2020*F >> Apr 12 14:07:03 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start >> CPND_RETENTION timer id = 0x7f86f04fe3a0, arg=0x7f86f0501ef0 Apr 12 >> 14:07:04 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start >> CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 >> >> ====================================================================== >> ===================================================== >> >> >> -AVM >> >> >> On 4/12/2017 11:08 AM, A V Mahesh wrote: >>> Hi Hoang, >>> >>> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote: >>>> Dear Mahesh, >>>> >>>> Based on what I saw, in this case, retention time cannot detect CPND >>>> temporarily down because its pid changed. >>> I will check that , I have some test cases based this retention time >>> , not sure how were they working. >>> >>> Can you please provide reproducible steps, I did look at ticket , but >>> looks complex , if you have any application that reproduces the case >>> please share. >>> >>> -AVM >>>> If cpnd is temporary down only, we don't need clean up anything. >>>> If cpnd is permanently down, the bad effect of this proposal is that >>>> replica is not clean up. But if cpnd permanently down, we have to >>>> reboot node for recovering so I think this cleanup is not really >>>> necessary. >>>> >>>> I also checked this implementation with possible test cases and have >>>> not seen any side effect. >>>> Please consider it. >>>> >>>> Thank you and best regards, >>>> Hoang >>>> >>>> -----Original Message----- >>>> From: A V Mahesh [mailto:mahesh.va...@oracle.com] >>>> Sent: Friday, February 10, 2017 10:40 AM >>>> To: Hoang Vo <hoang.m...@dektech.com.au>; >>>> zoran.milinko...@ericsson.com >>>> Cc: opensaf-devel@lists.sourceforge.net >>>> Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of >>>> cpsv [#1765] V5 >>>> >>>> Hi Hoang, >>>> >>>> The CPD_CPND_DOWN_RETENTION is to recognize, ether CPND temporarily >>>> down or permanently down, this is started a CPND is down and based >>>> on cpd_evt_proc_timer_expiry(), cpd recognize that the CPND is >>>> complete down and do cleanup, else cpnd rejoined with in >>>> CPD_CPND_DOWN_RETENTION_TIME , the CPD_CPND_DOWN_RETENTION is >>>> stoped. >>>> >>>> If we stop CPD_CPND_DOWN_RETENTION timer in cpd_process_cpnd_dow(), >>>> do cpd recognize the CPD permanently down, the >>>> cpd_process_cpnd_dow() being called in multiple flows, can you >>>> please check all the flows, is stopping CPD_CPND_DOWN_RETENTION >>>> timer has any impact ? >>>> >>>> -AVM >>>> >>>> On 2/9/2017 1:35 PM, Hoang Vo wrote: >>>>> src/ckpt/ckptd/cpd_proc.c | 11 ++++++++++- >>>>> 1 files changed, 10 insertions(+), 1 deletions(-) >>>>> >>>>> >>>>> problem: >>>>> In case failover multiple times, the cpnd is down for a moment so >>>>> there is no cpnd opening specific checkpoint. This lead to >>>>> retention timer >>>> is trigger. >>>>> When cpnd is up again but has different pid so retention timer is >>>>> not >>>> stoped. >>>>> Repica is deleted at retention while its information still be in >>>>> ckpt >>>> database. >>>>> That cause problem >>>>> >>>>> Fix: >>>>> - Stop timer of removed node. >>>>> - Update data in patricia trees (for retention value consistence). >>>>> >>>>> diff --git a/src/ckpt/ckptd/cpd_proc.c b/src/ckpt/ckptd/cpd_proc.c >>>>> --- a/src/ckpt/ckptd/cpd_proc.c >>>>> +++ b/src/ckpt/ckptd/cpd_proc.c >>>>> @@ -679,7 +679,8 @@ uint32_t cpd_process_cpnd_down(CPD_CB *c >>>>> cpd_cpnd_info_node_find_add(&cb->cpnd_tree, cpnd_dest, >>>>> &cpnd_info, >>>> &add_flag); >>>>> if (!cpnd_info) >>>>> return NCSCC_RC_SUCCESS; >>>>> - >>>>> + /* Stop timer before processing down */ >>>>> + cpd_tmr_stop(&cpnd_info->cpnd_ret_timer); >>>>> cref_info = cpnd_info->ckpt_ref_list; >>>>> while (cref_info) { >>>>> @@ -989,6 +990,14 @@ uint32_t cpd_proc_retention_set(CPD_CB * >>>>> /* Update the retention Time */ >>>>> (*ckpt_node)->ret_time = reten_time; >>>>> + (*ckpt_node)->attributes.retentionDuration = reten_time; >>>>> + >>>>> + /* Update the related patricia tree */ >>>>> + CPD_CKPT_MAP_INFO *map_info = NULL; >>>>> + cpd_ckpt_map_node_get(&cb->ckpt_map_tree, >>>>> (*ckpt_node)->ckpt_name, >>>> &map_info); >>>>> + if (map_info) { >>>>> + map_info->attributes.retentionDuration = reten_time; >>>>> + } >>>>> return rc; >>>>> } > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel