[tickets] [opensaf:tickets] Re: #1105 AMFD: New standby crashes if blocked on becoming applier

Anders Bjornerstedt Wed, 17 Jun 2015 05:34:02 -0700


On 06/17/2015 01:28 PM, Anders Bjornerstedt wrote:
>
> Hi
>
> Fix (1) fixes the problem reported in 1111 (111 is an enhancement).
> Fix (2) is only a partial fix for (2) that fixes #1105 for the si-swap 
> case. Not sure about the failover case.
>
I just reproduced problem #1105 for the fail-over case (not switchover).
To do so only requires a CCB that lingers, say for 240 seconds before 
applying and that the PBE (IMMND coord)
resides at SC standby before failover. If the PBE (IMMND coord) resides 
at active before failover then it has to
re-attach at standby at failover and since the PBE invokes the admin-op 
for aborting non-critical CCBs when
re-attaching, the AMF is in that case saved by the PBE. But that will be 
rouchly 50% of the failover cases.


If the PBE does not need restart at failover, because it already resided 
at old-standby-new-active, then
the AMFD old-standby-new-active is not saved by the PBE and will reboot 
resulting in CLUSTER RELOAD.

So I claim that to really fix #1105, not just for the si-swap 
interference problem but also for the fail-over case,
you really need the fix for #1108.
There are of course alternatives to a fix of type #1108.
But why not take that one when we have it instead of inventing yet 
another way, or delaying indefinitely
becoming AMF-OI ?

The solution of the AMFD invoking an admin-op on the IMM was earlier 
"rejected" with the motivation that
such a solution was "proprietary". While "proprietary" is not the 
correct words for decribing a mechanism
that is public and part of an open-source implementation, I giess the 
complaint was that the solution was
OpenSAF specific. But I dont get what the problem would be with the 
internals of OpenSAF being OpensAF specific.

/AndersBj

> Ticket #1108 is also an enhancement that will speed up the progress of 
> any si-swap or failover that has problems
> setting OI (or applier).
> I see enhancement #1108 as still a valid enhancement even after we 
> have this proposed fix for #1105.
> The fix proposed in #1108 is also trivial to implement. Just send the 
> admin-op request asynchronously.
> No need to wait on a response.
>
> /AndersBj
>
> From: Nagendra Kumar [mailto:[email protected]]
> Sent: den 17 juni 2015 12:46
> To: [opensaf:tickets]
> Subject: [opensaf:tickets] #1105 AMFD: New standby crashes if blocked 
> on becoming applier
>
> Here is what I go along:
> 1. Amf will return TRY_AGAIN for the SI-SWAP admin op when any ccb is 
> going on. And AMF will also set the "error string" appropriately.
> 2. AMF will return TRY_AGAIN in response of CCB, when SI-SWAP is in 
> progress and AMF will also set the "error string".
>
> Also #1108 and #1111 will be closed.
>
> Thanks,
> -Nagu
>
> ------------------------------------------------------------------------
>
> [tickets:#1105] 
> <http://sourceforge.net/p/opensaf/tickets/1105>http://sourceforge.net/p/opensaf/tickets/1105
>  
> AMFD: New standby crashes if blocked on becoming applier
>
> Status: accepted
> Milestone: 4.5.2
> Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
> Last Updated: Wed Jun 17, 2015 09:48 AM UTC
> Owner: Nagendra Kumar
>
> This ticket is in essence a continuation of ticket #1078
>
> http://sourceforge.net/p/opensaf/tickets/1078/ 
> <http://sourceforge.net/p/opensaf/tickets/1078>http://sourceforge.net/p/opensaf/tickets/1078
>
> In switchover, the new standby fails to attach as AMFD applier. It retries
> this for a limited time (45 seconds os so), but finally gives up and 
> AMFD standby
> restarts.
>
> In ticket 1078 the blockage was actually caused by a bug because the 
> lingering
> CCB was in that case not interfering with AMF data (data monitored by the
> AMFD-OI and the AMFD-applier). That "false" interference is fixed by 
> the patch
> for #1078.
>
> But this ticket tracks the case of true interference. The very same 
> symptom
> can be acheived by creating a CCB that modifies an AMF object and then 
> lingers.
> An si-swap done in this setup will result in the new standby rebooting 
> after
> it gives up in retrying.
>
> The new active AMFD is doing the very same thing, failing to set itself
> as OI 'saAmfService' becaue of the interfering CCB. But the crashed 
> standby
> AMFD triggers the restart of that SC, which triggers a sync, which 
> aborts the
> CCB removing the blockage for the new active AMFD.
>
> Note that this scenario is not totally unrealistic. An operator starts to
> build a CCB. Forgets about it and then performs an si-swap. That will 
> cause
> an SC restart. Not good.
>
> While a good NBI frontend should buffer the ccb and only send it to 
> the system
> when the operator does his/her high level apply, we can not rely on that.
>
> I reproduced this scenario by hacking immcfg so that it waits 60 
> seconds before
> invoking the saImmOmCcbApply. Then invoked this on one node:
>
> immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ 
> safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN
>
> The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
> inside immcfg itself and aborting the CCB before the scenario can 
> complete.
>
> Quickly after invoking the above I order an si-swap from another 
> shell/node:
>
> immadm -o 7 safSi=SC-2N,safApp=OpenSAF
>
> The basic problem here is that neither the AMFD-OI nor the 
> AMFD-applier can
> attach as long as there is an active, non-empty ccb, that contains 
> operations
> on AMF objects.
>
> The first level of solution in my opinion is that both AMFDs should retry
> forever (in a separate thread assumed to be the case already) to attach as
> implementer/applier. A notification should be sent periodically
> to inform the operator or whomever is listening that thre is a lingering
> AMF related CCB that should be terminated (aborted or committed by the 
> user).
>
> Rebooting an SC is a very coarse way of clearing CCBs. The Immsv 
> should provide
> an admin-operation for this purpose. The active AMFD could invoke this 
> admop
> to trigger the immsv to clear all non-critical CCBs. It should do this 
> if it
> ends up in the implementer-set TRY_AGAIN loop. Preferably after it has 
> waited
> for a while. Adding such an admin-operation to the immsv and implementing
> its use in AMF should probably be seen as two enhacnements.
>
> The really thorny issue is that there can be blocked critical CCBs.
> These are CCBs where the immsv is waiting on the result of commit from 
> PBE.
> The probability is low that there is both a critical CCB stuck and that it
> contains AMF object operations, but it can happen. Such a system is in 
> ANY CASE
> stuck in its CCB processing so the AMF should wait indefinitely here.
> Currently the system should cluster restart after some time. Not good.
> The immsv can not clear critical CCBs by itself. The only option is to
> use the admin-op (already implemented) for emergency disablement of PBE.
>
> To summarize: This defect ticket is only concerned with the problem of 
> the AMF
> rebooting its standby when this scenario occurs. This should be changed to
> eternal wait with periodic notifications. The AMF service is 
> functioning but
> can not process configuration changes on its data while in this state.
> That is not a fatal condition and so should not be esclated to SC restart.
>
> The problem of how to clear the interfering CCB can be solved in many 
> ways.
> A short term alternative (a hack solution) is for the AMF to reboot a 
> payload.
> That would also trigger a sync clearing al non critical CCBs.
>
> ------------------------------------------------------------------------
>
> Sent from sourceforge.net because you indicated interest in 
> https://sourceforge.net/p/opensaf/tickets/1105/ 
> <https://sourceforge.net/p/opensaf/tickets/1105>https://sourceforge.net/p/opensaf/tickets/1105
>
> To unsubscribe from further messages, please visit 
> https://sourceforge.net/auth/subscriptions/ 
> <https://sourceforge.net/auth/subscriptions>https://sourceforge.net/auth/subscriptions
>
> ------------------------------------------------------------------------
>
> *[tickets:#1105] <http://sourceforge.net/p/opensaf/tickets/1105> AMFD: 
> New standby crashes if blocked on becoming applier*
>
> *Status:* accepted
> *Milestone:* 4.5.2
> *Created:* Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
> *Last Updated:* Wed Jun 17, 2015 10:45 AM UTC
> *Owner:* Nagendra Kumar
>
> This ticket is in essence a continuation of ticket #1078
>
> http://sourceforge.net/p/opensaf/tickets/1078/ 
> <http://sourceforge.net/p/opensaf/tickets/1078>
>
> In switchover, the new standby fails to attach as AMFD applier. It retries
> this for a limited time (45 seconds os so), but finally gives up and 
> AMFD standby
> restarts.
>
> In ticket 1078 the blockage was actually caused by a bug because the 
> lingering
> CCB was in that case not interfering with AMF data (data monitored by the
> AMFD-OI and the AMFD-applier). That "false" interference is fixed by 
> the patch
> for #1078.
>
> But this ticket tracks the case of true interference. The very same 
> symptom
> can be acheived by creating a CCB that modifies an AMF object and then 
> lingers.
> An si-swap done in this setup will result in the new standby rebooting 
> after
> it gives up in retrying.
>
> The new active AMFD is doing the very same thing, failing to set itself
> as OI 'saAmfService' becaue of the interfering CCB. But the crashed 
> standby
> AMFD triggers the restart of that SC, which triggers a sync, which 
> aborts the
> CCB removing the blockage for the new active AMFD.
>
> Note that this scenario is not totally unrealistic. An operator starts to
> build a CCB. Forgets about it and then performs an si-swap. That will 
> cause
> an SC restart. Not good.
>
> While a good NBI frontend should buffer the ccb and only send it to 
> the system
> when the operator does his/her high level apply, we can not rely on that.
>
> I reproduced this scenario by hacking immcfg so that it waits 60 
> seconds before
> invoking the saImmOmCcbApply. Then invoked this on one node:
>
> immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \
> safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN
>
> The high immcfg timeout (-t 120) is needed to avoid the OM side timing 
> out
> inside immcfg itself and aborting the CCB before the scenario can 
> complete.
>
> Quickly after invoking the above I order an si-swap from another 
> shell/node:
>
> immadm -o 7 safSi=SC-2N,safApp=OpenSAF
>
> The basic problem here is that neither the AMFD-OI nor the 
> AMFD-applier can
> attach as long as there is an active, non-empty ccb, that contains 
> operations
> on AMF objects.
>
> The first level of solution in my opinion is that both AMFDs should retry
> forever (in a separate thread assumed to be the case already) to attach as
> implementer/applier. A notification should be sent periodically
> to inform the operator or whomever is listening that thre is a lingering
> AMF related CCB that should be terminated (aborted or committed by the 
> user).
>
> Rebooting an SC is a very coarse way of clearing CCBs. The Immsv 
> should provide
> an admin-operation for this purpose. The active AMFD could invoke this 
> admop
> to trigger the immsv to clear all non-critical CCBs. It should do this 
> if it
> ends up in the implementer-set TRY_AGAIN loop. Preferably after it has 
> waited
> for a while. Adding such an admin-operation to the immsv and implementing
> its use in AMF should probably be seen as two enhacnements.
>
> The really thorny issue is that there can be blocked critical CCBs.
> These are CCBs where the immsv is waiting on the result of commit from 
> PBE.
> The probability is low that there is both a critical CCB stuck and 
> that it
> contains AMF object operations, but it can happen. Such a system is in 
> ANY CASE
> stuck in its CCB processing so the AMF should wait indefinitely here.
> Currently the system should cluster restart after some time. Not good.
> The immsv can not clear critical CCBs by itself. The only option is to
> use the admin-op (already implemented) for emergency disablement of PBE.
>
> To summarize: This defect ticket is only concerned with the problem of 
> the AMF
> rebooting its standby when this scenario occurs. This should be changed to
> eternal wait with periodic notifications. The AMF service is 
> functioning but
> can not process configuration changes on its data while in this state.
> That is not a fatal condition and so should not be esclated to SC restart.
>
> The problem of how to clear the interfering CCB can be solved in many 
> ways.
> A short term alternative (a hack solution) is for the AMF to reboot a 
> payload.
> That would also trigger a sync clearing al non critical CCBs.
>
> ------------------------------------------------------------------------
>
> Sent from sourceforge.net because you indicated interest in 
> https://sourceforge.net/p/opensaf/tickets/1105/ 
> <https://sourceforge.net/p/opensaf/tickets/1105>
>
> To unsubscribe from further messages, please visit 
> https://sourceforge.net/auth/subscriptions/ 
> <https://sourceforge.net/auth/subscriptions>
>




---

** [tickets:#1105] AMFD: New standby crashes if blocked on becoming applier**

**Status:** accepted
**Milestone:** 4.5.2
**Created:** Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
**Last Updated:** Wed Jun 17, 2015 10:45 AM UTC
**Owner:** Nagendra Kumar

This ticket is in essence a continuation of ticket #1078

  http://sourceforge.net/p/opensaf/tickets/1078/

In switchover, the new standby fails to attach as AMFD applier. It retries
this for a limited time (45 seconds os so), but finally gives up and AMFD 
standby
restarts. 

In ticket 1078 the blockage was actually caused by a bug because the lingering
CCB was in that case not interfering with AMF data (data monitored by the 
AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
for #1078.

But this ticket tracks the case of true interference. The very same symptom
can be acheived by creating a CCB that modifies an AMF object and then lingers.
An si-swap done in this setup will result in the new standby rebooting after
it gives up in retrying. 

The new active AMFD is doing the very same thing, failing to set itself
as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
AMFD triggers the restart of that SC, which triggers a sync, which aborts the
CCB removing the blockage for the new active AMFD. 

Note that this scenario is not totally unrealistic. An operator starts to
build a CCB. Forgets about it and then performs an si-swap. That will cause
an SC restart. Not good.

While a good NBI frontend should buffer the ccb and only send it to the system
when the operator does his/her high level apply, we can not rely on that. 

I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
invoking the saImmOmCcbApply. Then invoked this on one node:

immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ 
  safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

The high immcfg timeout (-t 120) is needed to avoid the OM side timing out 
inside immcfg itself and aborting the CCB before the scenario can complete.

Quickly after invoking the above I order an si-swap from another shell/node:

  immadm -o 7 safSi=SC-2N,safApp=OpenSAF


The basic problem here is that neither the AMFD-OI nor the AMFD-applier can 
attach as long as there is an active, non-empty ccb, that contains operations
on AMF objects.

The first level of solution in my opinion is that both AMFDs should retry
forever (in a separate thread assumed to be the case already) to attach as
implementer/applier. A notification should be sent periodically
to inform the operator or whomever is listening that thre is a lingering
AMF related CCB that should be terminated (aborted or committed by the user). 

Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
an admin-operation for this purpose. The active AMFD could invoke this admop
to trigger the immsv to clear all non-critical CCBs. It should do this if it
ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
for a while. Adding such an admin-operation to the immsv and implementing
its use in AMF should probably be seen as two enhacnements.

The really thorny issue is that there can be blocked critical CCBs.
These are CCBs where the immsv is waiting on the result of commit from PBE. 
The probability is low that there is both a critical CCB stuck and that it 
contains AMF object operations, but it can happen. Such a system is in ANY CASE
stuck in its CCB processing so the AMF should wait indefinitely here. 
Currently the system should cluster restart after some time. Not good.
The immsv can not clear critical CCBs by itself. The only option is to
use the admin-op (already implemented) for emergency disablement of PBE.

To summarize: This defect ticket is only concerned with the problem of the AMF
rebooting its standby when this scenario occurs. This should be changed to
eternal wait with periodic notifications. The AMF service is functioning but
can not process configuration changes on its data while in this state.
That is not a fatal condition and so should not be esclated to SC restart.

The problem of how to clear the interfering CCB can be solved in many ways. 
A short term alternative (a hack solution) is for the AMF to reboot a payload.
That would also trigger a sync clearing al non critical CCBs.



---

Sent from sourceforge.net because [email protected] is 
subscribed to http://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
http://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------

_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] Re: #1105 AMFD: New standby crashes if blocked on becoming applier

Reply via email to