[tickets] [opensaf:tickets] Re: #1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover

Anders Bjornerstedt Sun, 28 Jun 2015 23:49:30 -0700

For critical CCBs the wait can be indefinite since the delay can be due to 
problems on the file system.


The AMF should not block a failover just because it can not attach as OI.
There is no inherent functional dependence of the AMF  failover mechanism on 
the AMF OI being available.
Any such dependency is unnecessary and an impediment to service availability.

/AndersBj


From: Nagendra Kumar [mailto:[email protected]]
Sent: den 22 juni 2015 08:54
To: [email protected]
Subject: [tickets] [opensaf:tickets] #1105 AMFD: New standby crashes if blocked 
on becoming applier - both failover and switchover


For non-critical ccb, ticket #1391 will take care.
For critical ccb, Amf should ok to wait a little when PBE delays the response.

So, I would be going ahead and implementing the two points mentioned above as 
part of #1105 and others will get closed.

Thanks
-Nagu

________________________________

[tickets:#1105]<http://sourceforge.net/p/opensaf/tickets/1105> AMFD: New 
standby crashes if blocked on becoming applier - both failover and switchover

Status: accepted
Milestone: 4.5.2
Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
Last Updated: Wed Jun 17, 2015 12:59 PM UTC
Owner: Nagendra Kumar

This ticket is in essence a continuation of ticket #1078

http://sourceforge.net/p/opensaf/tickets/1078/<http://sourceforge.net/p/opensaf/tickets/1078>

In switchover, the new standby fails to attach as AMFD applier. It retries
this for a limited time (45 seconds os so), but finally gives up and AMFD 
standby
restarts.

In ticket 1078 the blockage was actually caused by a bug because the lingering
CCB was in that case not interfering with AMF data (data monitored by the
AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
for #1078.

But this ticket tracks the case of true interference. The very same symptom
can be acheived by creating a CCB that modifies an AMF object and then lingers.
An si-swap done in this setup will result in the new standby rebooting after
it gives up in retrying.

The new active AMFD is doing the very same thing, failing to set itself
as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
AMFD triggers the restart of that SC, which triggers a sync, which aborts the
CCB removing the blockage for the new active AMFD.

Note that this scenario is not totally unrealistic. An operator starts to
build a CCB. Forgets about it and then performs an si-swap. That will cause
an SC restart. Not good.

While a good NBI frontend should buffer the ccb and only send it to the system
when the operator does his/her high level apply, we can not rely on that.

I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
invoking the saImmOmCcbApply. Then invoked this on one node:

immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \
safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
inside immcfg itself and aborting the CCB before the scenario can complete.

Quickly after invoking the above I order an si-swap from another shell/node:

immadm -o 7 safSi=SC-2N,safApp=OpenSAF

The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
attach as long as there is an active, non-empty ccb, that contains operations
on AMF objects.

The first level of solution in my opinion is that both AMFDs should retry
forever (in a separate thread assumed to be the case already) to attach as
implementer/applier. A notification should be sent periodically
to inform the operator or whomever is listening that thre is a lingering
AMF related CCB that should be terminated (aborted or committed by the user).

Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
an admin-operation for this purpose. The active AMFD could invoke this admop
to trigger the immsv to clear all non-critical CCBs. It should do this if it
ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
for a while. Adding such an admin-operation to the immsv and implementing
its use in AMF should probably be seen as two enhacnements.

The really thorny issue is that there can be blocked critical CCBs.
These are CCBs where the immsv is waiting on the result of commit from PBE.
The probability is low that there is both a critical CCB stuck and that it
contains AMF object operations, but it can happen. Such a system is in ANY CASE
stuck in its CCB processing so the AMF should wait indefinitely here.
Currently the system should cluster restart after some time. Not good.
The immsv can not clear critical CCBs by itself. The only option is to
use the admin-op (already implemented) for emergency disablement of PBE.

To summarize: This defect ticket is only concerned with the problem of the AMF
rebooting its standby when this scenario occurs. This should be changed to
eternal wait with periodic notifications. The AMF service is functioning but
can not process configuration changes on its data while in this state.
That is not a fatal condition and so should not be esclated to SC restart.

The problem of how to clear the interfering CCB can be solved in many ways.
A short term alternative (a hack solution) is for the AMF to reboot a payload.
That would also trigger a sync clearing al non critical CCBs.

________________________________

Sent from sourceforge.net because 
[email protected]<mailto:[email protected]>
 is subscribed to 
https://sourceforge.net/p/opensaf/tickets/<https://sourceforge.net/p/opensaf/tickets>

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a 
mailing list, you can unsubscribe from the mailing list.



---

** [tickets:#1105] AMFD: New standby crashes if blocked on becoming applier - 
both failover and switchover**

**Status:** review
**Milestone:** 4.5.2
**Created:** Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
**Last Updated:** Fri Jun 26, 2015 09:34 AM UTC
**Owner:** Nagendra Kumar

This ticket is in essence a continuation of ticket #1078

  http://sourceforge.net/p/opensaf/tickets/1078/

In switchover, the new standby fails to attach as AMFD applier. It retries
this for a limited time (45 seconds os so), but finally gives up and AMFD 
standby
restarts. 

In ticket 1078 the blockage was actually caused by a bug because the lingering
CCB was in that case not interfering with AMF data (data monitored by the 
AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
for #1078.

But this ticket tracks the case of true interference. The very same symptom
can be acheived by creating a CCB that modifies an AMF object and then lingers.
An si-swap done in this setup will result in the new standby rebooting after
it gives up in retrying. 

The new active AMFD is doing the very same thing, failing to set itself
as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
AMFD triggers the restart of that SC, which triggers a sync, which aborts the
CCB removing the blockage for the new active AMFD. 

Note that this scenario is not totally unrealistic. An operator starts to
build a CCB. Forgets about it and then performs an si-swap. That will cause
an SC restart. Not good.

While a good NBI frontend should buffer the ccb and only send it to the system
when the operator does his/her high level apply, we can not rely on that. 

I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
invoking the saImmOmCcbApply. Then invoked this on one node:

immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ 
  safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

The high immcfg timeout (-t 120) is needed to avoid the OM side timing out 
inside immcfg itself and aborting the CCB before the scenario can complete.

Quickly after invoking the above I order an si-swap from another shell/node:

  immadm -o 7 safSi=SC-2N,safApp=OpenSAF


The basic problem here is that neither the AMFD-OI nor the AMFD-applier can 
attach as long as there is an active, non-empty ccb, that contains operations
on AMF objects.

The first level of solution in my opinion is that both AMFDs should retry
forever (in a separate thread assumed to be the case already) to attach as
implementer/applier. A notification should be sent periodically
to inform the operator or whomever is listening that thre is a lingering
AMF related CCB that should be terminated (aborted or committed by the user). 

Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
an admin-operation for this purpose. The active AMFD could invoke this admop
to trigger the immsv to clear all non-critical CCBs. It should do this if it
ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
for a while. Adding such an admin-operation to the immsv and implementing
its use in AMF should probably be seen as two enhacnements.

The really thorny issue is that there can be blocked critical CCBs.
These are CCBs where the immsv is waiting on the result of commit from PBE. 
The probability is low that there is both a critical CCB stuck and that it 
contains AMF object operations, but it can happen. Such a system is in ANY CASE
stuck in its CCB processing so the AMF should wait indefinitely here. 
Currently the system should cluster restart after some time. Not good.
The immsv can not clear critical CCBs by itself. The only option is to
use the admin-op (already implemented) for emergency disablement of PBE.

To summarize: This defect ticket is only concerned with the problem of the AMF
rebooting its standby when this scenario occurs. This should be changed to
eternal wait with periodic notifications. The AMF service is functioning but
can not process configuration changes on its data while in this state.
That is not a fatal condition and so should not be esclated to SC restart.

The problem of how to clear the interfering CCB can be solved in many ways. 
A short term alternative (a hack solution) is for the AMF to reboot a payload.
That would also trigger a sync clearing al non critical CCBs.



---

Sent from sourceforge.net because [email protected] is 
subscribed to http://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
http://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical & virtual servers, alerts via email & sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o

_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] Re: #1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover

Reply via email to