Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5

Vo Minh Hoang Wed, 12 Apr 2017 02:22:33 -0700

Dear Mahesh,

Sorry when it takes time to recall some long lost information.


Bellowing is the reproduce steps in newest source code:
- create some non-collocated checkpoints in SC-1
- make failover occur by pkill -9 amfd
- do that again 4 time with active SC
- check /run/shm found that all replicas gone
- create same name checkpoint again and got SA_AIS_ERR_LIBRARY

Sincerely,
Hoang

-----Original Message-----
From: A V Mahesh [mailto:[email protected]] 
Sent: Wednesday, April 12, 2017 3:50 PM
To: Vo Minh Hoang <[email protected]>; [email protected]
Cc: [email protected]; Ramesh Babu Betham
<[email protected]>
Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv
[#1765] V5

Hi Hoang,

On 2/10/2017 3:09 PM, Vo Minh Hoang wrote:
> If cpnd is temporary down only, we don't need clean up anything.
> If cpnd is permanently down, the bad effect of this proposal is that 
> replica is not clean up. But if cpnd permanently down, we have to 
> reboot node for recovering so I think this cleanup is not really
necessary.
>
> I also checked this implementation with possible test cases and have 
> not seen any side effect.
> Please consider it
We are observing new node_user_info  databases mismatch Errors, while
testing multiple CPND restart with this patch,I will do more debugging and
update the root cause.

============================================================================
===============================================

Apr 12 14:06:57 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start
CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 *Apr 12
14:06:58 SC-1 osafckptd[27594]: ER cpd_proc_decrease_node_user_info failed -
no user on node id 0x2020F* Apr 12 14:06:58 SC-1 osafckptd[27594]: NO
cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0501750,
arg=0x7f86f0501ef0 *Apr 12 14:06:59 SC-1 osafckptd[27594]: ER
cpd_proc_decrease_node_user_info failed - no user on node id 0x2020F* Apr 12
14:06:59 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION
timer id = 0x7f86f0503ab0, arg=0x7f86f0501ef0 Apr 12 14:07:00 SC-1
osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id =
0x7f86f0500c70, arg=0x7f86f0501ef0 Apr 12 14:07:01 SC-1 osafckptd[27594]: NO
cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500930,
arg=0x7f86f0501ef0 *Apr 12 14:07:03 SC-1 osafckptd[27594]: ER
cpd_proc_decrease_node_user_info failed - no user on node id 0x2020*F Apr 12
14:07:03 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION
timer id = 0x7f86f04fe3a0, arg=0x7f86f0501ef0 Apr 12 14:07:04 SC-1
osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id =
0x7f86f0500cf0, arg=0x7f86f0501ef0

============================================================================
===============================================

-AVM


On 4/12/2017 11:08 AM, A V Mahesh wrote:
> Hi Hoang,
>
> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote:
>> Dear Mahesh,
>>
>> Based on what I saw, in this case, retention time cannot detect CPND 
>> temporarily down because its pid changed.
> I will check that , I have some test cases based this retention time , 
> not sure how were they working.
>
> Can you please provide reproducible steps, I did look at ticket , but 
> looks complex , if you have any application that reproduces the case 
> please share.
>
> -AVM
>>
>> If cpnd is temporary down only, we don't need clean up anything.
>> If cpnd is permanently down, the bad effect of this proposal is that 
>> replica is not clean up. But if cpnd permanently down, we have to 
>> reboot node for recovering so I think this cleanup is not really 
>> necessary.
>>
>> I also checked this implementation with possible test cases and have 
>> not seen any side effect.
>> Please consider it.
>>
>> Thank you and best regards,
>> Hoang
>>
>> -----Original Message-----
>> From: A V Mahesh [mailto:[email protected]]
>> Sent: Friday, February 10, 2017 10:40 AM
>> To: Hoang Vo <[email protected]>; 
>> [email protected]
>> Cc: [email protected]
>> Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv 
>> [#1765] V5
>>
>> Hi Hoang,
>>
>> The CPD_CPND_DOWN_RETENTION  is to recognize, ether CPND temporarily 
>> down or permanently down, this is started a CPND is down and based on 
>> cpd_evt_proc_timer_expiry(), cpd recognize that the CPND is complete 
>> down and do cleanup, else  cpnd rejoined with in 
>> CPD_CPND_DOWN_RETENTION_TIME , the CPD_CPND_DOWN_RETENTION is stoped.
>>
>> If we stop CPD_CPND_DOWN_RETENTION timer in cpd_process_cpnd_dow(), 
>> do cpd recognize the CPD permanently down, the cpd_process_cpnd_dow() 
>> being called in multiple flows, can you please check all the flows, 
>> is stopping CPD_CPND_DOWN_RETENTION timer has any impact ?
>>
>> -AVM
>>
>> On 2/9/2017 1:35 PM, Hoang Vo wrote:
>>>    src/ckpt/ckptd/cpd_proc.c |  11 ++++++++++-
>>>    1 files changed, 10 insertions(+), 1 deletions(-)
>>>
>>>
>>> problem:
>>> In case failover multiple times, the cpnd is down for a moment so 
>>> there is no cpnd opening specific checkpoint. This lead to retention 
>>> timer
>> is trigger.
>>> When cpnd is up again but has different pid so retention timer is 
>>> not
>> stoped.
>>> Repica is deleted at retention while its information still be in 
>>> ckpt
>> database.
>>> That cause problem
>>>
>>> Fix:
>>> - Stop timer of removed node.
>>> - Update data in patricia trees (for retention value consistence).
>>>
>>> diff --git a/src/ckpt/ckptd/cpd_proc.c b/src/ckpt/ckptd/cpd_proc.c
>>> --- a/src/ckpt/ckptd/cpd_proc.c
>>> +++ b/src/ckpt/ckptd/cpd_proc.c
>>> @@ -679,7 +679,8 @@ uint32_t cpd_process_cpnd_down(CPD_CB *c
>>>        cpd_cpnd_info_node_find_add(&cb->cpnd_tree, cpnd_dest, 
>>> &cpnd_info,
>> &add_flag);
>>>        if (!cpnd_info)
>>>            return NCSCC_RC_SUCCESS;
>>> -
>>> +    /* Stop timer before processing down */
>>> +    cpd_tmr_stop(&cpnd_info->cpnd_ret_timer);
>>>        cref_info = cpnd_info->ckpt_ref_list;
>>>           while (cref_info) {
>>> @@ -989,6 +990,14 @@ uint32_t cpd_proc_retention_set(CPD_CB *
>>>           /* Update the retention Time */
>>>        (*ckpt_node)->ret_time = reten_time;
>>> +    (*ckpt_node)->attributes.retentionDuration = reten_time;
>>> +
>>> +    /* Update the related patricia tree */
>>> +    CPD_CKPT_MAP_INFO *map_info = NULL;
>>> +    cpd_ckpt_map_node_get(&cb->ckpt_map_tree, 
>>> + (*ckpt_node)->ckpt_name,
>> &map_info);
>>> +    if (map_info) {
>>> +        map_info->attributes.retentionDuration = reten_time;
>>> +    }
>>>        return rc;
>>>    }
>>
>



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5

Reply via email to