- **summary**: opensaf shall support indefinite unavailbility of both
controllers (Hydra V1) --> Support multiple node failures without cluster
restart (Hydra V1)
- Description has changed:
Diff:
~~~~
--- old
+++ new
@@ -1,21 +1,14 @@
-The opensaf cluster shall survive that both system controllers are
indefinitely unavaliable i.e. down (no cluster reboot as today)
+The opensaf cluster shall survive simultaneous failure of multiple nodes
without initiating a cluster restart. In particular, it shall support
simultaneous failure of both controller nodes. To support long lasting and/or
permanent node failure, OpenSAF must be able to move the system controller
functionality to any node in the cluster. After the system controllers recover,
either on the same nodes as before or on some other nodes, IMM and AMF state
shall be as before the controllers got unavailable.
-After the system controllers recover, IMM and AMF state shall be as before the
controllers got unavailable.
+Since AMF state can not change while the system controllers are unavailable,
this means that AMF can not react to service availability events for as long as
the cluster is running without an active system controller. This means that
service availability (a statistical property) will be impacted in relation to
how often this new feature is excercised. Therefore, it is important that a new
system controller can be elected and come into service as quickly as possible
to minimise the time spent in this "headless" state.
-Since AMF state can not change, this means that AMF can not react to
-service availability events. This means that service availability (a
statistical
-property) will be impacted in relation to how often this new feature is
-excercised.
-
-It is important that *everyone* understands this.
-There is no magic being done here.
-
-Use case: opensaf cloud deployment. In a cloud deployment, the risk for
multiple simultaneous node failures is increased due to a number of reasons:
+The use case for this is OpenSAF deployment within a cloud. In a cloud
deployment, the risk for multiple simultaneous node failures is increased due
to a number of reasons:
* The hardware used to build cloud infrastructure may not be carrier-grade.
* The hypervisor is an extra layer which can also cause VM failures.
* Multiple VMs can be hosted on the same physical hardware. There is no
standardized interface for querying if two nodes are located on the same
physical machine.
* Live migration of VMs can cause disruptions
* The "Pets vs cattle" thinking: There is an expectation that VMs can be
treated as "cattle", i.e. that the loss of a few VMs shall not have a
devastating effect on the whole cluster (which can consist of a hundred nodes).
+* Consolidation of IT and telecom systems.
To be refined a lot...
~~~~
---
** [tickets:#1132] Support multiple node failures without cluster restart
(Hydra V1)**
**Status:** unassigned
**Milestone:** 4.6.FC
**Created:** Tue Sep 23, 2014 01:51 PM UTC by Hans Feldt
**Last Updated:** Tue Dec 02, 2014 01:54 PM UTC
**Owner:** nobody
The opensaf cluster shall survive simultaneous failure of multiple nodes
without initiating a cluster restart. In particular, it shall support
simultaneous failure of both controller nodes. To support long lasting and/or
permanent node failure, OpenSAF must be able to move the system controller
functionality to any node in the cluster. After the system controllers recover,
either on the same nodes as before or on some other nodes, IMM and AMF state
shall be as before the controllers got unavailable.
Since AMF state can not change while the system controllers are unavailable,
this means that AMF can not react to service availability events for as long as
the cluster is running without an active system controller. This means that
service availability (a statistical property) will be impacted in relation to
how often this new feature is excercised. Therefore, it is important that a new
system controller can be elected and come into service as quickly as possible
to minimise the time spent in this "headless" state.
The use case for this is OpenSAF deployment within a cloud. In a cloud
deployment, the risk for multiple simultaneous node failures is increased due
to a number of reasons:
* The hardware used to build cloud infrastructure may not be carrier-grade.
* The hypervisor is an extra layer which can also cause VM failures.
* Multiple VMs can be hosted on the same physical hardware. There is no
standardized interface for querying if two nodes are located on the same
physical machine.
* Live migration of VMs can cause disruptions
* The "Pets vs cattle" thinking: There is an expectation that VMs can be
treated as "cattle", i.e. that the loss of a few VMs shall not have a
devastating effect on the whole cluster (which can consist of a hundred nodes).
* Consolidation of IT and telecom systems.
To be refined a lot...
---
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets