[tickets] [opensaf:tickets] Re: #1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover

Anders Bjornerstedt Mon, 29 Jun 2015 00:35:08 -0700

Even if the AMF re-reads its entire AMF model from IMM during the PBE absence 
it will see the state
Of the AMF model without that CCB being committed. IMM-ram only commits the CCB 
*after* the PBE has
returned and responded on the outcome of the CCB.

The good news is that the IMM is behaving entirely transactionally with CCBs.
The only bad news is that the AMF currently does not  wish to follow the 
transactional model (relative to imm data) during failover.

/AndersBj

From: Anders Bjornerstedt [mailto:[email protected]]
Sent: den 29 juni 2015 09:22
To: [opensaf:tickets]
Subject: [opensaf:tickets] Re: #1105 AMFD: New standby crashes if blocked on 
becoming applier - both failover and switchover

It is impossible for the IMM to do anything about a critical CCB until the PBE 
re-attaches and the PBE can only re-attach
when the file system is available. So not the IMM can not force the issue here. 
It can neither abort nor commit the CCB
since this would have a 50/50 chance of diverging from the PBE representation 
of the CCB. A cluster restart may very
well happen before the file system comes back, in fact a cluster restart may be 
caused by the long absence of the
file system. It is in fact what happens here.

As far as the AMF is concerned, since nether old active or old standby new 
active can have received any apply
or abort callback in this case, the AMF should act as if it is still waiting 
for the commit of the CCB, i.e. as if the
before-image for the CCB is what is valid and it is.

/AndersBj

From: Nagendra Kumar [mailto:[email protected]]
Sent: den 29 juni 2015 09:02
To: [opensaf:tickets]
Subject: [opensaf:tickets] #1105 AMFD: New standby crashes if blocked on 
becoming applier - both failover and switchover

Hi Anders,
Thanks for your comments. But imm need to take time-bound action in that case. 
It can't wait for a response for long.

Thanks
-Nagu

________________________________

[tickets:#1105]<http://sourceforge.net/p/opensaf/tickets/1105>http://sourceforge.net/p/opensaf/tickets/1105
 AMFD: New standby crashes if blocked on becoming applier - both failover and 
switchover

Status: review
Milestone: 4.5.2
Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
Last Updated: Fri Jun 26, 2015 09:34 AM UTC
Owner: Nagendra Kumar

This ticket is in essence a continuation of ticket #1078

http://sourceforge.net/p/opensaf/tickets/1078/<http://sourceforge.net/p/opensaf/tickets/1078>http://sourceforge.net/p/opensaf/tickets/1078

In switchover, the new standby fails to attach as AMFD applier. It retries
this for a limited time (45 seconds os so), but finally gives up and AMFD 
standby
restarts.

In ticket 1078 the blockage was actually caused by a bug because the lingering
CCB was in that case not interfering with AMF data (data monitored by the
AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
for #1078.

But this ticket tracks the case of true interference. The very same symptom
can be acheived by creating a CCB that modifies an AMF object and then lingers.
An si-swap done in this setup will result in the new standby rebooting after
it gives up in retrying.

The new active AMFD is doing the very same thing, failing to set itself
as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
AMFD triggers the restart of that SC, which triggers a sync, which aborts the
CCB removing the blockage for the new active AMFD.

Note that this scenario is not totally unrealistic. An operator starts to
build a CCB. Forgets about it and then performs an si-swap. That will cause
an SC restart. Not good.

While a good NBI frontend should buffer the ccb and only send it to the system
when the operator does his/her high level apply, we can not rely on that.

I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
invoking the saImmOmCcbApply. Then invoked this on one node:

immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ 
safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
inside immcfg itself and aborting the CCB before the scenario can complete.

Quickly after invoking the above I order an si-swap from another shell/node:

immadm -o 7 safSi=SC-2N,safApp=OpenSAF

The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
attach as long as there is an active, non-empty ccb, that contains operations
on AMF objects.

The first level of solution in my opinion is that both AMFDs should retry
forever (in a separate thread assumed to be the case already) to attach as
implementer/applier. A notification should be sent periodically
to inform the operator or whomever is listening that thre is a lingering
AMF related CCB that should be terminated (aborted or committed by the user).

Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
an admin-operation for this purpose. The active AMFD could invoke this admop
to trigger the immsv to clear all non-critical CCBs. It should do this if it
ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
for a while. Adding such an admin-operation to the immsv and implementing
its use in AMF should probably be seen as two enhacnements.

The really thorny issue is that there can be blocked critical CCBs.
These are CCBs where the immsv is waiting on the result of commit from PBE.
The probability is low that there is both a critical CCB stuck and that it
contains AMF object operations, but it can happen. Such a system is in ANY CASE
stuck in its CCB processing so the AMF should wait indefinitely here.
Currently the system should cluster restart after some time. Not good.
The immsv can not clear critical CCBs by itself. The only option is to
use the admin-op (already implemented) for emergency disablement of PBE.

To summarize: This defect ticket is only concerned with the problem of the AMF
rebooting its standby when this scenario occurs. This should be changed to
eternal wait with periodic notifications. The AMF service is functioning but
can not process configuration changes on its data while in this state.
That is not a fatal condition and so should not be esclated to SC restart.

The problem of how to clear the interfering CCB can be solved in many ways.
A short term alternative (a hack solution) is for the AMF to reboot a payload.
That would also trigger a sync clearing al non critical CCBs.

________________________________

Sent from sourceforge.net because you indicated interest in 
https://sourceforge.net/p/opensaf/tickets/1105/<https://sourceforge.net/p/opensaf/tickets/1105>https://sourceforge.net/p/opensaf/tickets/1105

To unsubscribe from further messages, please visit 
https://sourceforge.net/auth/subscriptions/<https://sourceforge.net/auth/subscriptions>https://sourceforge.net/auth/subscriptions

________________________________

[tickets:#1105]<http://sourceforge.net/p/opensaf/tickets/1105> AMFD: New 
standby crashes if blocked on becoming applier - both failover and switchover

Status: review
Milestone: 4.5.2
Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
Last Updated: Mon Jun 29, 2015 07:02 AM UTC
Owner: Nagendra Kumar

This ticket is in essence a continuation of ticket #1078

http://sourceforge.net/p/opensaf/tickets/1078/<http://sourceforge.net/p/opensaf/tickets/1078>

In switchover, the new standby fails to attach as AMFD applier. It retries
this for a limited time (45 seconds os so), but finally gives up and AMFD 
standby
restarts.

In ticket 1078 the blockage was actually caused by a bug because the lingering
CCB was in that case not interfering with AMF data (data monitored by the
AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
for #1078.

But this ticket tracks the case of true interference. The very same symptom
can be acheived by creating a CCB that modifies an AMF object and then lingers.
An si-swap done in this setup will result in the new standby rebooting after
it gives up in retrying.

The new active AMFD is doing the very same thing, failing to set itself
as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
AMFD triggers the restart of that SC, which triggers a sync, which aborts the
CCB removing the blockage for the new active AMFD.

Note that this scenario is not totally unrealistic. An operator starts to
build a CCB. Forgets about it and then performs an si-swap. That will cause
an SC restart. Not good.

While a good NBI frontend should buffer the ccb and only send it to the system
when the operator does his/her high level apply, we can not rely on that.

I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
invoking the saImmOmCcbApply. Then invoked this on one node:

immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \
safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
inside immcfg itself and aborting the CCB before the scenario can complete.

Quickly after invoking the above I order an si-swap from another shell/node:

immadm -o 7 safSi=SC-2N,safApp=OpenSAF

The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
attach as long as there is an active, non-empty ccb, that contains operations
on AMF objects.

The first level of solution in my opinion is that both AMFDs should retry
forever (in a separate thread assumed to be the case already) to attach as
implementer/applier. A notification should be sent periodically
to inform the operator or whomever is listening that thre is a lingering
AMF related CCB that should be terminated (aborted or committed by the user).

Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
an admin-operation for this purpose. The active AMFD could invoke this admop
to trigger the immsv to clear all non-critical CCBs. It should do this if it
ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
for a while. Adding such an admin-operation to the immsv and implementing
its use in AMF should probably be seen as two enhacnements.

The really thorny issue is that there can be blocked critical CCBs.
These are CCBs where the immsv is waiting on the result of commit from PBE.
The probability is low that there is both a critical CCB stuck and that it
contains AMF object operations, but it can happen. Such a system is in ANY CASE
stuck in its CCB processing so the AMF should wait indefinitely here.
Currently the system should cluster restart after some time. Not good.
The immsv can not clear critical CCBs by itself. The only option is to
use the admin-op (already implemented) for emergency disablement of PBE.

To summarize: This defect ticket is only concerned with the problem of the AMF
rebooting its standby when this scenario occurs. This should be changed to
eternal wait with periodic notifications. The AMF service is functioning but
can not process configuration changes on its data while in this state.
That is not a fatal condition and so should not be esclated to SC restart.

The problem of how to clear the interfering CCB can be solved in many ways.
A short term alternative (a hack solution) is for the AMF to reboot a payload.
That would also trigger a sync clearing al non critical CCBs.

________________________________

Sent from sourceforge.net because you indicated interest in 
https://sourceforge.net/p/opensaf/tickets/1105/<https://sourceforge.net/p/opensaf/tickets/1105>

To unsubscribe from further messages, please visit 
https://sourceforge.net/auth/subscriptions/<https://sourceforge.net/auth/subscriptions>

---

** [tickets:#1105] AMFD: New standby crashes if blocked on becoming applier - 
both failover and switchover**

**Status:** review
**Milestone:** 4.5.2
**Created:** Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
**Last Updated:** Mon Jun 29, 2015 07:02 AM UTC
**Owner:** Nagendra Kumar

This ticket is in essence a continuation of ticket #1078

  http://sourceforge.net/p/opensaf/tickets/1078/

In switchover, the new standby fails to attach as AMFD applier. It retries
this for a limited time (45 seconds os so), but finally gives up and AMFD 
standby
restarts. 

In ticket 1078 the blockage was actually caused by a bug because the lingering
CCB was in that case not interfering with AMF data (data monitored by the 
AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
for #1078.

But this ticket tracks the case of true interference. The very same symptom
can be acheived by creating a CCB that modifies an AMF object and then lingers.
An si-swap done in this setup will result in the new standby rebooting after
it gives up in retrying. 

The new active AMFD is doing the very same thing, failing to set itself
as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
AMFD triggers the restart of that SC, which triggers a sync, which aborts the
CCB removing the blockage for the new active AMFD. 

Note that this scenario is not totally unrealistic. An operator starts to
build a CCB. Forgets about it and then performs an si-swap. That will cause
an SC restart. Not good.

While a good NBI frontend should buffer the ccb and only send it to the system
when the operator does his/her high level apply, we can not rely on that. 

I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
invoking the saImmOmCcbApply. Then invoked this on one node:

immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ 
  safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

The high immcfg timeout (-t 120) is needed to avoid the OM side timing out 
inside immcfg itself and aborting the CCB before the scenario can complete.

Quickly after invoking the above I order an si-swap from another shell/node:

  immadm -o 7 safSi=SC-2N,safApp=OpenSAF

The basic problem here is that neither the AMFD-OI nor the AMFD-applier can 
attach as long as there is an active, non-empty ccb, that contains operations
on AMF objects.

The first level of solution in my opinion is that both AMFDs should retry
forever (in a separate thread assumed to be the case already) to attach as
implementer/applier. A notification should be sent periodically
to inform the operator or whomever is listening that thre is a lingering
AMF related CCB that should be terminated (aborted or committed by the user). 

Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
an admin-operation for this purpose. The active AMFD could invoke this admop
to trigger the immsv to clear all non-critical CCBs. It should do this if it
ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
for a while. Adding such an admin-operation to the immsv and implementing
its use in AMF should probably be seen as two enhacnements.

The really thorny issue is that there can be blocked critical CCBs.
These are CCBs where the immsv is waiting on the result of commit from PBE. 
The probability is low that there is both a critical CCB stuck and that it 
contains AMF object operations, but it can happen. Such a system is in ANY CASE
stuck in its CCB processing so the AMF should wait indefinitely here. 
Currently the system should cluster restart after some time. Not good.
The immsv can not clear critical CCBs by itself. The only option is to
use the admin-op (already implemented) for emergency disablement of PBE.

To summarize: This defect ticket is only concerned with the problem of the AMF
rebooting its standby when this scenario occurs. This should be changed to
eternal wait with periodic notifications. The AMF service is functioning but
can not process configuration changes on its data while in this state.
That is not a fatal condition and so should not be esclated to SC restart.

The problem of how to clear the interfering CCB can be solved in many ways. 
A short term alternative (a hack solution) is for the AMF to reboot a payload.
That would also trigger a sync clearing al non critical CCBs.

---

Sent from sourceforge.net because [email protected] is 
subscribed to http://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
http://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical & virtual servers, alerts via email & sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o

_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] Re: #1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover

Reply via email to