1. When SU goes to Term failure state, Amfnd should inform Amfd to take
corrective actions. All other components in the SU except the component in term
failed state, should be cleaned up.
2. Amfd can send Quiesced or remove assignment to Amfnd.
3. Amfnd shouldn’t process the assignment as SU is in term failed state
and should store the SUSI message in the buffer.
4. The state of Amfd and Amfnd for that particular Su will be suspended in
the same state.
5. When repair action is performed on the term failed SU(even when SG is
unstable), then Amfd should allow the admin operations.
6. Amfd should send the admin operation to Amfnd for instantiating the SU.
7. Amfnd should instantiate the SU and then process the buffered SUSI
command and respond back the SUSI command to Amfd.
8. If SU was faulted in term failed state when assignment was undergoing,
then Amfd shouldn’t send any next SUSI assignment to Amfnd when Amfnd informs
Amfd for SU term failure. Rather, when repair admin command lands on Amfnd, it
should enable SU and then assign the undergoing SUSI and respond back to Amfd.
9. ‘The above points’ stops the SG when SU goes to term fail state and
after admin command is performed, then it starts from the state when SG was
stopped. It is like suspending the SG for that SU for any other operation than
admin repair command.
10. The faults can happen in other assigned/unassigned SUs or node hosted
by those SUs, when SG state is suspended for term failed SU. The action taken
on other SUs, will be as per SG FSM state.
11. Faulted SU holds the same state when repaired. That means Su is
supposed to be in Inservice for all SG FSM purpose.
---
** [tickets:#538] AMF: fail-over assignments despite comps in TERM-FAILED
state**
**Status:** unassigned
**Created:** Fri Aug 09, 2013 06:43 AM UTC by Hans Feldt
**Last Updated:** Wed Oct 09, 2013 06:02 PM UTC
**Owner:** nobody
AMF currently performs fail-over recovery action although a component is in
termination-failed presence state. This can lead to severe inconsistencies for
the application. The specification also clearly states how this should work in
4.8:
"If the component and any of its contained components (for a container
component)
were assigned the active HA state for some component service instances when the
CLEANUP command was executed, and semantics of the redundancy model of its
enclosing service group guarantee that at a point in time only one component
can be
in the active HA state for a given component service instance, the failure to
terminate
that component prevents the Availability Management Framework from assigning to
another component the active HA state for these component service instances (and
by the same token prevents the assignment of the active HA state to other
service
units for the service instances that contain the involved CSIs). In this case,
the ser-
vice instances will stay unassigned until an administrative action is performed
to ter-
minate the failed component."
Can be tested by running the AMF 2N sa-aware sample app and modifying the
cleanup script to do "exit 1" which gives this effect when the active component
is killed:
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO
'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' faulted due to
'avaDown' : Recovery is 'componentRestart'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Cleanup of
'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' failed
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Reason:'Exec of script success, but
script exits with non-zero status'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Exit code: 1
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Component Failover trigerred for
'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1': Failed component:
'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO
'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATED =>
TERMINATION_FAILED
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Assigning
'safSi=AmfDemo,safApp=AmfDemo1' QUIESCED to
'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Assigned
'safSi=AmfDemo,safApp=AmfDemo1' QUIESCED to
'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Assigning
'safSi=AmfDemo,safApp=AmfDemo1' ACTIVE to
'safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro amf_demo[11620]: CSI Set - HAState Active for all
assigned CSIs
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Assigned
'safSi=AmfDemo,safApp=AmfDemo1' ACTIVE to
'safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Removing
'safSi=AmfDemo,safApp=AmfDemo1' from 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Removed
'safSi=AmfDemo,safApp=AmfDemo1' from 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets