Hi Mahesh,
Would you agree with the comment below?
To summarize, following are the comment so far:
Comment 1: This functionality should be under checks if Hydra configuration
is enabled in IMM attrName =
const_cast<SaImmAttrNameT>("scAbsenceAllowed").
Action: The code will be updated accordingly.
Comment 2: To keep the scope of CPSV service as non-collocated checkpoint
creation NOT_SUPPORTED , if cluster is running with IMMSV_SC_ABSENCE_ALLOWED
( headless state configuration enabled at the time of cluster startup
currently it is not configurable , so there no chance of run-time
configuration change ).
Action: No change in code. The CPSV still keep supporting non-collocated
checkpoint even if IMMSV_SC_ABSENCE_ALLOWED is enable.
Comment 3: This is about case where checkpoint node director (cpnd) crashes
during headless state. In this case the cpnd can't finish starting because
it can't initialize CLM service.
Then after time out, the AMF triggers a restart again. Finally, the node is
rebooted.
It is expected that this problem should not lead to a node reboot.
Action: No change in code. This is the limitation of the system during
headless state.
If you agree with the summary above, I'll update code and send out the V3
for review.
Best regards,
Nhat Pham
From: Anders Widell [mailto:[email protected]]
Sent: Wednesday, February 24, 2016 9:26 PM
To: Nhat Pham <[email protected]>; 'A V Mahesh'
<[email protected]>
Cc: [email protected]; 'Beatriz Brandao'
<[email protected]>; 'Minh Chau H' <[email protected]>
Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support preserving and
recovering checkpoint replicas during headless state V2 [#1621]
See my comments inline, marked [AndersW3].
regards,
Anders Widell
On 02/24/2016 07:32 AM, Nhat Pham wrote:
Hi Mahesh and Anders,
Please see my comments below.
Best regards,
Nhat Pham
From: A V Mahesh [mailto:[email protected]]
Sent: Wednesday, February 24, 2016 11:06 AM
To: Nhat Pham <mailto:[email protected]> <[email protected]>;
'Anders Widell' <mailto:[email protected]>
<[email protected]>
Cc: [email protected]
<mailto:[email protected]> ; 'Beatriz Brandao'
<mailto:[email protected]> <[email protected]>; 'Minh
Chau H' <mailto:[email protected]> <[email protected]>
Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support preserving and
recovering checkpoint replicas during headless state V2 [#1621]
Hi Nhat Pham,
If component ( CPND ) restart allows while Controllers absent , before
requesting CLM going to change return value to SA_AIS_ERR_TRY_AGAIN ,
We need to get clarification from AMF guys on few things why because if
CPND is on SA_AIS_ERR_TRY_AGAIN and component restart timeout
then AMF will restart component again ( this become cyclic ) and after
saAmfSGCompRestartMax configured value Node gose for reboot as next level
escalation,
in that case we may required changes in AMF as well, to not to act on
component restart timeout in case of Controllers absent ( i am not sure it
is deviation of AMF specification ) .
[Nhat Pham] In headless state, I'm not sure about this either.
@Anders: Would you have comments about this?
[AndersW3] Ok, first of all I would like to point out that normally, the
OpenSAF checkpoint node director should not crash. So we are talking about a
situation where multiple faults have occurred: first both the active and the
standby system controllers have died, and then shortly afterwards - before
we have a new active system controller - the checkpoint node director also
crashes. Sure, these may not be totally independent events, but still there
are a lot of faults that have happened within a short period of time. We
should test the node director and make sure it doesn't crash in this type of
scenario.
Now, let's consider the case where we have a fault in the node director that
causes it to crash during the headless state. The general philosophy of the
headless feature is that when things work fine - i.e. in the absence of
fault - we should be able to continue running while the system controllers
are absent. However, if a fault happens during the headless state, we may
not be able to recover from the fault until there is an active system
controller. AMF does provide support for restarting components, but as you
have pointed out, the node director will be stuck in a TRY_AGAIN loop
immediately after it has been restarted. So this means that if the node
director crashes during the headless state, we have lost the checkpoint
functionality on that node and we will not get it back until there is an
active system controller. Other services like IMM will still work for a
while, but AMF will as you say eventually escalate the checkpoint node
director failure to a node restart and then the whole node is gone. The node
will not come back until we have an active system controller. So to
summarize: there is very limited support for recovering from faults that
happen during the headless state. The full recovery will not happen until we
have an active system controller.
Please do incorporate current comments ( in design prospective ) and
republish the patch , I will re-test V3 patch and provide review comments on
function issue/bugs if I found any.
One Important note , in the new patch let us not have any complexity of
allowing non-collocated checkpoint creation and then documenting that in
some scenario ,
non-collocated checkpoint replicas are recoverable , why because replica
is USER private data ( not Opensaf States ) , loosing USER private data
not acceptable .
so let us keep the scope of CPSV service as non-collocated checkpoint
creation NOT_SUPPORTED , if cluster is running with
IMMSV_SC_ABSENCE_ALLOWED ( headless state configuration enabled at the time
of cluster startup currently it is not configurable , so their no chance of
run-time configuration change ).
We can provide support for non-collocated in subsequent enhancements by
having solution like replica on lower node ID PL will also created
non-collocated ( max three riplicas in cluster regradless of where
non-collocated is opened ).
So for now, regardless of the heads (SC`s) status exist not exist CPSV
should return SA_AIS_ERR_NOT_SUPPORTED in case of IMMSV_SC_ABSENCE_ALLOWED
enabled cluster ,
and let us document it as well.
[Nhat Pham] The patch is to limit loosing replicas and checkpoints in case
of headless state.
In case both replicas locate on SCs and they reboot, loosing checkpoint is
unpreventable with current design after headless state.
Even if we implement the proposal "max three riplicas in cluster regradless
of where non-collocated is opened", there is still the case where the
checkpoint is lost. Ex. The SCs and the PL which hosts the replica reboot
same time.
In case IMMSV_SC_ABSENCE_ALLOWED disable, if both SCs reboot, this leads
whole cluster reboots. Then the checkpoint is lost.
What I mean is there are cases where the checkpoint is lost. The point is
what we can do to limit loosing data.
For the proposal of reject creating non-collocated checkpoint in case of
IMMSV_SC_ABSENCE_ALLOWED enabled, I think this will lead to in compatible
problem.
@Anders: How do you think about rejecting creating non-collocated checkpoint
in case of IMMSV_SC_ABSENCE_ALLOWED enabled?
[AndersW3] No, I think we ought to support non-colocated checkpoints also
when IMMSV_SC_ABSENCE_ALLOWED is set. The fact that we have "system
controllers" is an implementation detail of OpenSAF. I don't think the CKPT
SAF specification implies that non-colocated checkpoints must be fully
replicated on all the nodes in the cluster, and thus we must have the
possibility that all replicas are lost. It is not clear exactly what to
expect from the APIs when this happens, but you could handle it in a similar
way as the case when all sections have been automatically deleted by the
checkpoint service because the sections have expired.
-AVM
On 2/24/2016 6:51 AM, Nhat Pham wrote:
Hi Mahesh,
Do you have any further comments?
Best regards,
Nhat Pham
From: A V Mahesh [mailto:[email protected]]
Sent: Monday, February 22, 2016 10:37 AM
To: Nhat Pham <mailto:[email protected]> <[email protected]>;
'Anders Widell' <mailto:[email protected]>
<[email protected]>
Cc: [email protected]
<mailto:[email protected]> ; 'Beatriz Brandao'
<mailto:[email protected]> <[email protected]>; 'Minh
Chau H' <mailto:[email protected]> <[email protected]>
Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support preserving and
recovering checkpoint replicas during headless state V2 [#1621]
Hi,
>>BTW, have you finished the review and test?
I will finish by today.
-AVM
On 2/22/2016 7:48 AM, Nhat Pham wrote:
Hi Mahesh and Anders,
Please see my comment below.
BTW, have you finished the review and test?
Best regards,
Nhat Pham
From: A V Mahesh [mailto:[email protected]]
Sent: Friday, February 19, 2016 2:28 PM
To: Nhat Pham <mailto:[email protected]> <[email protected]>;
'Anders Widell' <mailto:[email protected]>
<[email protected]>; 'Minh Chau H'
<mailto:[email protected]> <[email protected]>
Cc: [email protected]
<mailto:[email protected]> ; 'Beatriz Brandao'
<mailto:[email protected]> <[email protected]>
Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support preserving and
recovering checkpoint replicas during headless state V2 [#1621]
Hi Nhat Pham,
On 2/19/2016 12:28 PM, Nhat Pham wrote:
Could you please give more detailed information about steps to reproduce the
problem below? Thanks.
Don't see this as specific bug , we need to see the issue as CLM
integrated service point of view ,
by considering Anders Widell explication about CLM application behavior
during headless state
we need to reintegrate CPND with CLM ( before this headless state feature
no case of CPND existence in the obscene of CLMD , but now it is ).
And this will be the consistent across the all services who integrated with
CLM ( you may need some changes in CLM also )
[Nhat Pham] I think CLM should return SA_AIS_ERR_TRY_AGAIN in this case.
@Anders. How would you think?
To start with let us consider case CPND on payload restarted on PL during
headless state
and an application is in running on PL.
[Nhat Pham] Regarding the CPND as CLM application, I'm not sure what it can
do in this case. In case it restarts, it is monitored by AMF.
If it blocks for too long, AMF will also trigger a node reboot.
In my test case, the CPND get blocked by CLM. It doesn't get out of the
saClmInitialize. How do you get the "ER cpnd clm init failed with return
value:31"?
Following is the cpnd trace.
Feb 22 8:56:41.188122 osafckptnd [736:cpnd_init.c:0183] >> cpnd_lib_init
Feb 22 8:56:41.188332 osafckptnd [736:cpnd_init.c:0412] >> cpnd_cb_db_init
Feb 22 8:56:41.188600 osafckptnd [736:cpnd_init.c:0437] << cpnd_cb_db_init
Feb 22 8:56:41.188778 osafckptnd [736:clma_api.c:0503] >> saClmInitialize
Feb 22 8:56:41.188945 osafckptnd [736:clma_api.c:0593] >> clmainitialize
Feb 22 8:56:41.190052 osafckptnd [736:clma_util.c:0100] >> clma_startup:
clma_use_count: 0
Feb 22 8:56:41.190273 osafckptnd [736:clma_mds.c:1124] >> clma_mds_init
Feb 22 8:56:41.190825 osafckptnd [736:clma_mds.c:1170] << clma_mds_init
-AVM
On 2/19/2016 12:28 PM, Nhat Pham wrote:
Hi Mahesh,
Could you please give more detailed information about steps to reproduce the
problem below? Thanks.
Best regards,
Nhat Pham
From: A V Mahesh [mailto:[email protected]]
Sent: Friday, February 19, 2016 1:06 PM
To: Anders Widell <mailto:[email protected]>
<[email protected]>; Nhat Pham <mailto:[email protected]>
<[email protected]>; 'Minh Chau H' <mailto:[email protected]>
<[email protected]>
Cc: [email protected]
<mailto:[email protected]> ; 'Beatriz Brandao'
<mailto:[email protected]> <[email protected]>
Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support preserving and
recovering checkpoint replicas during headless state V2 [#1621]
Hi Anders Widell,
Thanks for the detailed explanation about CLM during headless state.
HI Nhat Pham ,
Comment : 3
Please see below the problem I was interpreted now I seeing it during
CLMD obscene ( during headless state ),
so now CPND/CLMA need to to address below case , currently cpnd clm init
failed with return value: SA_AIS_ERR_UNAVAILABLE
but should be SA_AIS_ERR_TRY_AGAIN
==================================================
Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO NODE STATE->
IMM_NODE_FULLY_AVAILABLE 17418
Feb 19 11:18:28 PL-4 osafimmloadd: NO Sync ending normally
Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Epoch set to 9 in ImmModel
Feb 19 11:18:28 PL-4 cpsv_app: IN Received PROC_STALE_CLIENTS
Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 42
(MsgQueueService132111) <108, 2040f>
Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 43
(MsgQueueService131855) <0, 2030f>
Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 44
(safLogService) <0, 2010f>
Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO SERVER STATE:
IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 45
(safClmService) <0, 2010f>
Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd clm init failed with return
value:31
Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd init failed
Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd_lib_req FAILED
Feb 19 11:18:28 PL-4 osafckptnd[7718]: __init_cpnd() failed
Feb 19 11:18:28 PL-4 osafclmna[5432]: NO
safNode=PL-4,safCluster=myClmCluster Joined cluster, nodeid=2040f
Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO AVD NEW_ACTIVE, adest:1
Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO Sending node up due to
NCSMDS_NEW_ACTIVE
Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 SISU states sent
Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 SU states sent
Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 CSICOMP states synced
Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 SU states sent
Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 46
(safAmfService) <0, 2010f>
Feb 19 11:18:30 PL-4 osafamfnd[5441]: NO
'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Component or SU restart probation
timer expired
Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO Instantiation of
'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' failed
Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO Reason: component registration
timer expired
Feb 19 11:18:35 PL-4 osafamfnd[5441]: WA
'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Presence State
RESTARTING => INSTANTIATION_FAILED
Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO Component Failover trigerred for
'safSu=PL-4,safSg=NoRed,safApp=OpenSAF': Failed component:
'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
Feb 19 11:18:35 PL-4 osafamfnd[5441]: ER
'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'got Inst failed
Feb 19 11:18:35 PL-4 osafamfnd[5441]: Rebooting OpenSAF NodeId = 132111 EE
Name = , Reason: NCS component Instantiation failed, OwnNodeId = 132111,
SupervisionTime = 60
Feb 19 11:18:36 PL-4 opensaf_reboot: Rebooting local node; timeout=60
Feb 19 11:18:39 PL-4 kernel: [ 4877.338518] md: stopping all md devices.
==================================================
-AVM
On 2/15/2016 5:11 PM, Anders Widell wrote:
Hi!
Please find my answer inline, marked [AndersW].
regards,
Anders Widell
On 02/15/2016 10:38 AM, Nhat Pham wrote:
Hi Mahesh,
It's good. Thank you. :)
[AVM] Up on rejoining of the SC`s The replica should be re-created
regardless
of another application opens it on PL4.
( Note : this comment is based on your explanation have not
yet
reviewed/tested ,
currently i am struggling with SC`s not rejoining
after headless state , i can provide you more on this once i complte my
review/testing)
[Nhat] To make cloud resilience works, you need the patches from other
services (log, amf, clm, ntf).
@Minh: I heard that you created tar file which includes all patches. Could
you
please send it to Mahesh? Thanks
[AVM] I understand that , before I comment more on this please allow me to
understand
I am not still not very clear of the headless design in
detail.
For example cluster membership of PL`s during headless state
,
In the absence of SC`s (CLMD) dose the PLs is considered as
cluster nodes or not (cluster membership) ?
[Nhat] I don't know much about this.
@ Anders: Could you please have comment about this? Thanks
[AndersW] First of all, keep in mind that the "headless" state should
ideally not last a very long time. Once we have the spare SC feature in
place (ticket [#79]), a new SC should become active within a matter of a few
seconds after we have lost both the active and the standby SC.
I think you should view the state of the cluster in the headless state in
the same way as you view the state of the cluster during a failover between
the active and the standby SC. Imagine that the active SC dies. It takes the
standby SC 1.5 seconds to detect the failure of the active SC (this is due
to the TIPC timeout). If you have configured the PROMOTE_ACTIVE_TIMER, there
is an additional delay before the standby takes over as active. What is the
state of the cluster during the time after the active SC failed and before
the standby takes over?
The state of the cluster while it is headless is very similar. The
difference is that this state may last a little bit longer (though not more
than a few seconds, until one of the spare SCs becomes active). Another
difference is that we may have lost some state. With a "perfect"
implementation of the headless feature we should not lose any state at all,
but with the current set of patches we do lose state.
So specifically if we talk about cluster membership and ask the question: is
a particular PL a member of the cluster or not during the headless state?
Well, if you ask CLM about this during the headless state, then you will not
know - because CLM doesn't provide any service during the headless state. If
you keep retrying you query to CLM, you will eventually get an answer - but
you will not get this answer until there is an active SC again and we have
exited the headless state. When viewed in this way, the answer to the
question about a node's membership is undefined during the headless state,
since CLM will not provide you with any answer until there is an active SC.
However, if you asked CLM about the node's cluster membership status before
the cluster went headless, you probably saved a cached copy of the cluster
membership state. Maybe you also installed a CLM track callback and intend
to update this cached copy every time the cluster membership status changes.
The question then is: can you continue using this cached copy of the cluster
membership state during the headless state? The answer is YES: since CLM
doesn't provide any service during the headless state, it also means that
the cluster membership view cannot change during this time. Nodes can of
course reboot or die, but CLM will not notice and hence the cluster view
will not be updated. You can argue that this is bad because the cluster view
doesn't reflect reality, but notice that this will always be the case. We
can never propagate information instantaneously, and detection of node
failures will take 1.5 seconds due to the TIPC timeout. You can never be
sure that a node is alive at this very moment just because CLM tells you
that it is a member of the cluster. If we are unfortunate enough to lose
both system controller nodes simultaneously, updates to the cluster
membership view will be delayed a few seconds longer than usual.
Best regards,
Nhat Pham
-----Original Message-----
From: A V Mahesh [mailto:[email protected]]
Sent: Monday, February 15, 2016 11:19 AM
To: Nhat Pham <mailto:[email protected]> <[email protected]>;
[email protected] <mailto:[email protected]>
Cc: [email protected]
<mailto:[email protected]> ; 'Beatriz Brandao'
<mailto:[email protected]> <[email protected]>
Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support preserving and
recovering checkpoint replicas during headless state V2 [#1621]
Hi Nhat Pham,
How is your holiday went
Please find my comments below
On 2/15/2016 8:43 AM, Nhat Pham wrote:
Hi Mahesh,
For the comment 1, the patch will be updated accordingly.
[AVM] Please hold , I will provide more comments in this week , so we can
have consolidated V3
For the comment 2, I think the CKPT service will not be backward
compatible if the scAbsenceAllowed is true.
The client can't create non-collocated checkpoint on SCs.
Furthermore, this solution only protects the CKPT service from the
case "The non-collocated checkpoint is created on a SC"
there are still the cases where the replicas are completely lost. Ex:
- The non-collocated checkpoint created on a PL. The PL reboots. Both
replicas now locate on SCs. Then, headless state happens. All replicas are
lost.
- The non-collocated checkpoint has active replica locating on a PL
and this PL restarts during headless state
- The non-collocated checkpoint is created on PL3. This checkpoint is
also opened on PL4. Then SCs and PL3 reboot.
[AVM] Up on rejoining of the SC`s The replica should be re-created
regardless
of another application opens it on PL4.
( Note : this comment is based on your explanation have not
yet
reviewed/tested ,
currently i am struggling with SC`s not rejoining
after headless state , i can provide you more on this once i complte my
review/testing)
In this case, all replicas are lost and the client has to create it again.
In case multiple nodes (which including SCs) reboot, losing replicas
is unpreventable. The patch is to recover the checkpoints in possible cases.
How do you think?
[AVM] I understand that , before I comment more on this please allow
me to understand
I am not still not very clear of the headless design in
detail.
For example cluster membership of PL`s during headless
state ,
In the absence of SC`s (CLMD) dose the PLs is considered as
cluster nodes or not (cluster membership) ?
- if not consider as NON cluster nodes Checkpoint
Service
API should leverage the SA Forum Cluster
Membership Service and API's can fail with
SA_AIS_ERR_UNAVAILABLE
- if considers as cluster nodes we need to follow all
the
defined rules which are defined in SAI-AIS-CKPT-B.02.02 specification
so give me some more time to review it completely , so that we
can have consolidated patch V3
-AVM
Best regards,
Nhat Pham
-----Original Message-----
From: A V Mahesh [mailto:[email protected]]
Sent: Friday, February 12, 2016 11:10 AM
To: Nhat Pham <mailto:[email protected]> <[email protected]>;
[email protected] <mailto:[email protected]>
Cc: [email protected]
<mailto:[email protected]> ; Beatriz Brandao
<mailto:[email protected]> <[email protected]>
Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support
preserving and recovering checkpoint replicas during headless state V2
[#1621]
Comment 2 :
After incorporating the comment one all the Limitations should be
prevented based on Hydra configuration is enabled in IMM status.
Foe example : if some application is trying to create
non-collocated checkpoint active replica getting generated/locating on
SC then ,regardless of the heads (SC`s) status exist not exist should
return SA_AIS_ERR_NOT_SUPPORTED
In other words, rather that allowing to created non-collocated
checkpoint when
heads(SC`s) are exit , and non-collocated checkpoint getting
unrecoverable after heads(SC`s) rejoins.
======================================================================
=======================
Limitation: The CKPT service doesn't support recovering checkpoints in
following cases:
. The checkpoint which is unlinked before headless.
. The non-collocated checkpoint has active replica locating on SC.
. The non-collocated checkpoint has active replica locating on a PL
and this PL
restarts during headless state. In this cases, the checkpoint replica is
destroyed. The fault code SA_AIS_ERR_BAD_HANDLE is returned when the
client
accesses the checkpoint in these cases. The client must re-open the
checkpoint.
======================================================================
=======================
-AVM
On 2/11/2016 12:52 PM, A V Mahesh wrote:
Hi,
I jut starred reviewing patch , I will be giving comments as soon as
I crossover any , to save some time.
Comment 1 :
This functionality should be under checks if Hydra configuration is
enabled in IMM attrName =
const_cast<SaImmAttrNameT>("scAbsenceAllowed")
Please see example how LOG/AMF services implemented it.
-AVM
On 1/29/2016 1:02 PM, Nhat Pham wrote:
Hi Mahesh,
As described in the README, the CKPT service returns
SA_AIS_ERR_TRY_AGAIN fault code in this case.
I guess it's same for other services.
@Anders: Could you please confirm this?
Best regards,
Nhat Pham
-----Original Message-----
From: A V Mahesh [mailto:[email protected]]
Sent: Friday, January 29, 2016 2:11 PM
To: Nhat Pham <mailto:[email protected]> <[email protected]>;
[email protected] <mailto:[email protected]>
Cc: [email protected]
<mailto:[email protected]>
Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support
preserving and recovering checkpoint replicas during headless state
V2 [#1621]
Hi,
On 1/29/2016 11:45 AM, Nhat Pham wrote:
- The behavior of application will be consistent with other
saf services like imm/amf behavior during headless state.
[Nhat] I'm not clear what you mean about "consistent"?
In the obscene of Director (SC's) , what is expected return values
of SAF API should ( all services ) ,
which are not in aposition to provide service at that moment.
I think all services should return same SAF ERRS., I thinks
currently we don't have it , may be Anders Widel will help us.
-AVM
On 1/29/2016 11:45 AM, Nhat Pham wrote:
Hi Mahesh,
Please see the attachment for the README. Let me know if there is
any more information required.
Regarding your comments:
- during headless state applications may behave like during
CPND restart case [Nhat] Headless state and CPND restart are
different events. Thus, the behavior is different.
Headless state is a case where both SCs go down.
- The behavior of application will be consistent with other
saf services like imm/amf behavior during headless state.
[Nhat] I'm not clear what you mean about "consistent"?
Best regards,
Nhat Pham
-----Original Message-----
From: A V Mahesh [mailto:[email protected]]
Sent: Friday, January 29, 2016 11:12 AM
To: Nhat Pham <mailto:[email protected]> <[email protected]>;
[email protected] <mailto:[email protected]>
Cc: [email protected]
<mailto:[email protected]>
Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support
preserving and recovering checkpoint replicas during headless state
V2 [#1621]
Hi Nhat Pham,
I stared reviewing this patch , so can please provide README file
with scope and limitations , that will help to define
testing/reviewing scope .
Following are minimum things we can keep in mind while
reviewing/accepting patch ,
- Not effecting existing functionality
- during headless state applications may behave like during
CPND restart case
- The minimum functionally of application works
- The behavior of application will be consistent with
other saf services like imm/amf behavior during headless state.
So please do provide any additional detailed in README if any of
the above is deviated , that allow users to know about the
limitations/deviation.
-AVM
On 1/4/2016 3:15 PM, Nhat Pham wrote:
Summary: cpsv: Support preserving and recovering checkpoint
replicas during headless state [#1621] Review request for Trac
Ticket(s):
#1621 Peer Reviewer(s): [email protected]
<mailto:[email protected]> ;
[email protected] <mailto:[email protected]> Pull request
to:
[email protected] <mailto:[email protected]> Affected
branch(es): default Development
branch: default
--------------------------------
Impacted area Impact y/n
--------------------------------
Docs n
Build system n
RPM/packaging n
Configuration files n
Startup scripts n
SAF services y
OpenSAF services n
Core libraries n
Samples n
Tests n
Other n
Comments (indicate scope for each "y" above):
---------------------------------------------
changeset faec4a4445a4c23e8f630857b19aabb43b5af18d
Author: Nhat Pham <mailto:[email protected]>
<[email protected]>
Date: Mon, 04 Jan 2016 16:34:33 +0700
cpsv: Support preserving and recovering checkpoint replicas
during headless state [#1621]
Background:
---------- This enhancement supports to preserve checkpoint
replicas
in case
both SCs down (headless state) and recover replicas in case
one of
SCs up
again. If both SCs goes down, checkpoint replicas on
surviving nodes
still
remain. When a SC is available again, surviving replicas are
automatically
registered to the SC checkpoint database. Content in
surviving
replicas are
intacted and synchronized to new replicas.
When no SC is available, client API calls changing checkpoint
configuration
which requires SC communication, are rejected. Client API
calls
reading and
writing existing checkpoint replicas still work.
Limitation: The CKPT service does not support recovering
checkpoints
in
following cases:
- The checkpoint which is unlinked before headless.
- The non-collocated checkpoint has active replica locating
on SC.
- The non-collocated checkpoint has active replica locating
on a PL
and this
PL restarts during headless state. In this cases, the
checkpoint
replica is
destroyed. The fault code SA_AIS_ERR_BAD_HANDLE is returned
when the
client
accesses the checkpoint in these cases. The client must
re-open the
checkpoint.
While in headless state, accessing checkpoint replicas does
not work
if the
node which hosts the active replica goes down. It will back
working
when a
SC available again.
Solution:
--------- The solution for this enhancement includes 2 parts:
1. To destroy un-recoverable checkpoint described above when
both
SCs are
down: When both SCs are down, the CPND deletes un-recoverable
checkpoint
nodes and replicas on PLs. Then it requests CPA to destroy
corresponding
checkpoint node by using new message
CPA_EVT_ND2A_CKPT_DESTROY
2. To update CPD with checkpoint information When an active
SC is up
after
headless, CPND will update CPD with checkpoint information by
using
new
message CPD_EVT_ND2D_CKPT_INFO_UPDATE instead of using
CPD_EVT_ND2D_CKPT_CREATE. This is because the CPND will
create new
ckpt_id
for the checkpoint which might be different with the current
ckpt id
if the
CPD_EVT_ND2D_CKPT_CREATE is used. The CPD collects checkpoint
information
within 6s. During this updating time, following requests is
rejected
with
fault code SA_AIS_ERR_TRY_AGAIN:
- CPD_EVT_ND2D_CKPT_CREATE
- CPD_EVT_ND2D_CKPT_UNLINK
- CPD_EVT_ND2D_ACTIVE_SET
- CPD_EVT_ND2D_CKPT_RDSET
Complete diffstat:
------------------
osaf/libs/agents/saf/cpa/cpa_proc.c | 52
+++++++++++++++++++++++++++++++++++
osaf/libs/common/cpsv/cpsv_edu.c | 43
+++++++++++++++++++++++++++++
osaf/libs/common/cpsv/include/cpd_cb.h | 3 ++
osaf/libs/common/cpsv/include/cpd_imm.h | 1 +
osaf/libs/common/cpsv/include/cpd_proc.h | 7 ++++
osaf/libs/common/cpsv/include/cpd_tmr.h | 3 +-
osaf/libs/common/cpsv/include/cpnd_cb.h | 1 +
osaf/libs/common/cpsv/include/cpnd_init.h | 2 +
osaf/libs/common/cpsv/include/cpsv_evt.h | 20 +++++++++++++
osaf/services/saf/cpsv/cpd/Makefile.am | 3 +-
osaf/services/saf/cpsv/cpd/cpd_evt.c | 229
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++
osaf/services/saf/cpsv/cpd/cpd_imm.c | 112
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
osaf/services/saf/cpsv/cpd/cpd_init.c | 20 ++++++++++++-
osaf/services/saf/cpsv/cpd/cpd_proc.c | 309
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
osaf/services/saf/cpsv/cpd/cpd_tmr.c | 7 ++++
osaf/services/saf/cpsv/cpnd/cpnd_db.c | 16 ++++++++++
osaf/services/saf/cpsv/cpnd/cpnd_evt.c | 22 +++++++++++++++
osaf/services/saf/cpsv/cpnd/cpnd_init.c | 23 ++++++++++++++-
osaf/services/saf/cpsv/cpnd/cpnd_mds.c | 13 ++++++++
osaf/services/saf/cpsv/cpnd/cpnd_proc.c | 314
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
20 files changed, 1189 insertions(+), 11 deletions(-)
Testing Commands:
-----------------
-
Testing, Expected Results:
--------------------------
-
Conditions of Submission:
-------------------------
<<HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC>>
Arch Built Started Linux distro
-------------------------------------------
mips n n
mips64 n n
x86 n n
x86_64 n n
powerpc n n
powerpc64 n n
Reviewer Checklist:
-------------------
[Submitters: make sure that your review doesn't trigger any
checkmarks!]
Your checkin has not passed review because (see checked entries):
___ Your RR template is generally incomplete; it has too many
blank
entries
that need proper data filled in.
___ You have failed to nominate the proper persons for review and
push.
___ Your patches do not have proper short+long header
___ You have grammar/spelling in your header that is unacceptable.
___ You have exceeded a sensible line length in your
headers/comments/text.
___ You have failed to put in a proper Trac Ticket # into your
commits.
___ You have incorrectly put/left internal data in your comments/files
(i.e. internal bug tracking tool IDs, product names etc)
___ You have not given any evidence of testing beyond basic build
tests.
Demonstrate some level of runtime or other sanity testing.
___ You have ^M present in some of your files. These have to be
removed.
___ You have needlessly changed whitespace or added whitespace crimes
like trailing spaces, or spaces before tabs.
___ You have mixed real technical changes with whitespace and other
cosmetic code cleanup changes. These have to be separate
commits.
___ You need to refactor your submission into logical chunks; there is
too much content into a single commit.
___ You have extraneous garbage in your review (merge commits etc)
___ You have giant attachments which should never have been sent;
Instead you should place your content in a public tree to
be pulled.
___ You have too many commits attached to an e-mail; resend as
threaded
commits, or place in a public tree for a pull.
___ You have resent this content multiple times without a clear
indication
of what has changed between each re-send.
___ You have failed to adequately and individually address all of the
comments and change requests that were proposed in the
initial
review.
___ You have a misconfigured ~/.hgrc file (i.e. username, email
etc)
___ Your computer have a badly configured date and time; confusing the
the threaded patch review.
___ Your changes affect IPC mechanism, and you don't present any
results
for in-service upgradability test.
___ Your changes affect user manual and documentation, your patch
series
do not contain the patch that updates the Doxygen manual.
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel