[tickets] [opensaf:tickets] Re: #1105 AMFD: New standby crashes if blocked on becoming applier

Anders Bjornerstedt Wed, 17 Jun 2015 04:29:23 -0700

Hi

Fix (1)  fixes the problem reported in 1111 (111 is an enhancement).
Fix (2) is only a partial fix for (2) that fixes #1105 for the si-swap case. 
Not sure about the failover case.


Ticket #1108 is also an enhancement that will speed up the progress of any 
si-swap or failover that has problems
setting OI (or applier).
I see enhancement #1108 as still a valid enhancement even after we  have this 
proposed fix for #1105.
The fix proposed in #1108 is also trivial to implement. Just send the admin-op 
request asynchronously.
No need to wait on a response.

/AndersBj

From: Nagendra Kumar [mailto:[email protected]]
Sent: den 17 juni 2015 12:46
To: [opensaf:tickets]
Subject: [opensaf:tickets] #1105 AMFD: New standby crashes if blocked on 
becoming applier


Here is what I go along:
1. Amf will return TRY_AGAIN for the SI-SWAP admin op when any ccb is going on. 
And AMF will also set the "error string" appropriately.
2. AMF will return TRY_AGAIN in response of CCB, when SI-SWAP is in progress 
and AMF will also set the "error string".

Also #1108 and #1111 will be closed.

Thanks,
-Nagu

________________________________

[tickets:#1105]<http://sourceforge.net/p/opensaf/tickets/1105> AMFD: New 
standby crashes if blocked on becoming applier

Status: accepted
Milestone: 4.5.2
Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
Last Updated: Wed Jun 17, 2015 09:48 AM UTC
Owner: Nagendra Kumar

This ticket is in essence a continuation of ticket #1078

http://sourceforge.net/p/opensaf/tickets/1078/<http://sourceforge.net/p/opensaf/tickets/1078>

In switchover, the new standby fails to attach as AMFD applier. It retries
this for a limited time (45 seconds os so), but finally gives up and AMFD 
standby
restarts.

In ticket 1078 the blockage was actually caused by a bug because the lingering
CCB was in that case not interfering with AMF data (data monitored by the
AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
for #1078.

But this ticket tracks the case of true interference. The very same symptom
can be acheived by creating a CCB that modifies an AMF object and then lingers.
An si-swap done in this setup will result in the new standby rebooting after
it gives up in retrying.

The new active AMFD is doing the very same thing, failing to set itself
as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
AMFD triggers the restart of that SC, which triggers a sync, which aborts the
CCB removing the blockage for the new active AMFD.

Note that this scenario is not totally unrealistic. An operator starts to
build a CCB. Forgets about it and then performs an si-swap. That will cause
an SC restart. Not good.

While a good NBI frontend should buffer the ccb and only send it to the system
when the operator does his/her high level apply, we can not rely on that.

I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
invoking the saImmOmCcbApply. Then invoked this on one node:

immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \
safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
inside immcfg itself and aborting the CCB before the scenario can complete.

Quickly after invoking the above I order an si-swap from another shell/node:

immadm -o 7 safSi=SC-2N,safApp=OpenSAF

The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
attach as long as there is an active, non-empty ccb, that contains operations
on AMF objects.

The first level of solution in my opinion is that both AMFDs should retry
forever (in a separate thread assumed to be the case already) to attach as
implementer/applier. A notification should be sent periodically
to inform the operator or whomever is listening that thre is a lingering
AMF related CCB that should be terminated (aborted or committed by the user).

Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
an admin-operation for this purpose. The active AMFD could invoke this admop
to trigger the immsv to clear all non-critical CCBs. It should do this if it
ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
for a while. Adding such an admin-operation to the immsv and implementing
its use in AMF should probably be seen as two enhacnements.

The really thorny issue is that there can be blocked critical CCBs.
These are CCBs where the immsv is waiting on the result of commit from PBE.
The probability is low that there is both a critical CCB stuck and that it
contains AMF object operations, but it can happen. Such a system is in ANY CASE
stuck in its CCB processing so the AMF should wait indefinitely here.
Currently the system should cluster restart after some time. Not good.
The immsv can not clear critical CCBs by itself. The only option is to
use the admin-op (already implemented) for emergency disablement of PBE.

To summarize: This defect ticket is only concerned with the problem of the AMF
rebooting its standby when this scenario occurs. This should be changed to
eternal wait with periodic notifications. The AMF service is functioning but
can not process configuration changes on its data while in this state.
That is not a fatal condition and so should not be esclated to SC restart.

The problem of how to clear the interfering CCB can be solved in many ways.
A short term alternative (a hack solution) is for the AMF to reboot a payload.
That would also trigger a sync clearing al non critical CCBs.

________________________________

Sent from sourceforge.net because you indicated interest in 
https://sourceforge.net/p/opensaf/tickets/1105/<https://sourceforge.net/p/opensaf/tickets/1105>

To unsubscribe from further messages, please visit 
https://sourceforge.net/auth/subscriptions/<https://sourceforge.net/auth/subscriptions>



---

** [tickets:#1105] AMFD: New standby crashes if blocked on becoming applier**

**Status:** accepted
**Milestone:** 4.5.2
**Created:** Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
**Last Updated:** Wed Jun 17, 2015 10:45 AM UTC
**Owner:** Nagendra Kumar

This ticket is in essence a continuation of ticket #1078

  http://sourceforge.net/p/opensaf/tickets/1078/

In switchover, the new standby fails to attach as AMFD applier. It retries
this for a limited time (45 seconds os so), but finally gives up and AMFD 
standby
restarts. 

In ticket 1078 the blockage was actually caused by a bug because the lingering
CCB was in that case not interfering with AMF data (data monitored by the 
AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
for #1078.

But this ticket tracks the case of true interference. The very same symptom
can be acheived by creating a CCB that modifies an AMF object and then lingers.
An si-swap done in this setup will result in the new standby rebooting after
it gives up in retrying. 

The new active AMFD is doing the very same thing, failing to set itself
as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
AMFD triggers the restart of that SC, which triggers a sync, which aborts the
CCB removing the blockage for the new active AMFD. 

Note that this scenario is not totally unrealistic. An operator starts to
build a CCB. Forgets about it and then performs an si-swap. That will cause
an SC restart. Not good.

While a good NBI frontend should buffer the ccb and only send it to the system
when the operator does his/her high level apply, we can not rely on that. 

I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
invoking the saImmOmCcbApply. Then invoked this on one node:

immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ 
  safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

The high immcfg timeout (-t 120) is needed to avoid the OM side timing out 
inside immcfg itself and aborting the CCB before the scenario can complete.

Quickly after invoking the above I order an si-swap from another shell/node:

  immadm -o 7 safSi=SC-2N,safApp=OpenSAF


The basic problem here is that neither the AMFD-OI nor the AMFD-applier can 
attach as long as there is an active, non-empty ccb, that contains operations
on AMF objects.

The first level of solution in my opinion is that both AMFDs should retry
forever (in a separate thread assumed to be the case already) to attach as
implementer/applier. A notification should be sent periodically
to inform the operator or whomever is listening that thre is a lingering
AMF related CCB that should be terminated (aborted or committed by the user). 

Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
an admin-operation for this purpose. The active AMFD could invoke this admop
to trigger the immsv to clear all non-critical CCBs. It should do this if it
ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
for a while. Adding such an admin-operation to the immsv and implementing
its use in AMF should probably be seen as two enhacnements.

The really thorny issue is that there can be blocked critical CCBs.
These are CCBs where the immsv is waiting on the result of commit from PBE. 
The probability is low that there is both a critical CCB stuck and that it 
contains AMF object operations, but it can happen. Such a system is in ANY CASE
stuck in its CCB processing so the AMF should wait indefinitely here. 
Currently the system should cluster restart after some time. Not good.
The immsv can not clear critical CCBs by itself. The only option is to
use the admin-op (already implemented) for emergency disablement of PBE.

To summarize: This defect ticket is only concerned with the problem of the AMF
rebooting its standby when this scenario occurs. This should be changed to
eternal wait with periodic notifications. The AMF service is functioning but
can not process configuration changes on its data while in this state.
That is not a fatal condition and so should not be esclated to SC restart.

The problem of how to clear the interfering CCB can be solved in many ways. 
A short term alternative (a hack solution) is for the AMF to reboot a payload.
That would also trigger a sync clearing al non critical CCBs.



---

Sent from sourceforge.net because [email protected] is 
subscribed to http://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
http://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------

_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] Re: #1105 AMFD: New standby crashes if blocked on becoming applier

Reply via email to