Darren,

In a peer-to-peer model such as I describe, active-active is and is not a 
concept.  The supervision tree is responsible for identifying failure, and 
initiating process re-allocation for failed resources.  For example, if a pod’s 
management process crashed, it would also crash all of the processes managing 
the hosts in that pod.  The zone would then attempt to restart the pod’s 
management process (either local to the zone supervisor or on a remote instance 
which could be configurable) until it was able to start “ready” process for the 
child resource.  

This model requires a “special” root supervisor that is controlled by the 
orchestration tier which can identify when a zone supervisor becomes 
unavailable, and attempts to restart it.  The ownership of this “special” 
supervisor will require a consensus mechanism amongst the orchestration tier 
processes to elect an owner of the process and determine when a new owner needs 
to be elected (e.g. a Raft implementation such as barge [1]).  Given the 
orchestration tier is designed as an AP system, an orchestration tier process 
should be able to be an owner (i.e. the operator is not required to identify a 
“master” node).  There are likely other potential topologies (e.g. a root 
supervisor per zone rather than one for all zones), but in all cases ownership 
election would be the same.  Most importantly, there are no data durability 
requirements in this claim model.  When an orchestration process becomes unable 
to continue owning a root supervisor, the other orchestration processes 
recognize the missing owner and initiate ownership claim the process for the 
partition.

In all failure scenarios, the supervision tree must be rebuilt from the point 
of failure downward using the process allocation process I previously 
described.  For an initial implementation, I would recommend taking simply 
throwing any parts of the supervision tree that are already running in the 
event of a widespread failure (e.g. a zone with many pods).  Dependent on the 
recovery time and SLAs, a future optimization may be to re-attach “orphaned” 
branches of the previous tree to the tree being built as part of the recovery 
process (e.g. loss a zone supervisor due to a switch failure).  Additionally, 
the system would also need a mechanism to hand-off ownership of the root 
supervisor for planned outages (hardware upgrades/decommissioning, maintenance 
windows, etc).

Again, caveated with a a few hand waves, the idea is to build up a peer-to-peer 
management model that provides strict serialization guarantees.  Fundamentally, 
it utilizes a tree of processes to provide exclusive access, distribute work, 
and ensure availability requirements when partitions occur.  Details would need 
to be worked out for the best application to CloudStack (e.g root node 
ownership and orchestration tier gossip), but we would be implementing 
well-trod distributed systems concepts in the context cloud orchestration 
(sounds like a fun thing to do …).

Thanks,
-John

[1]: https://github.com/mgodave/barge

P.S. I see the libraries/frameworks referenced as the building blocks to a 
solution, but none of them (in whole or combination) solves the problem 
completely.

On Nov 25, 2013, at 12:29 PM, Darren Shepherd <darren.s.sheph...@gmail.com> 
wrote:

> I will ask one basic question.  How do you forsee managing one mailbox per 
> resource.  If I have multiple servers running in an active-active mode, how 
> do you determine which server has the mailbox?  Do you create actors on 
> demand?  How do you synchronize that operation?
> 
> Darren
> 
> 
> On Mon, Nov 25, 2013 at 10:16 AM, Darren Shepherd 
> <darren.s.sheph...@gmail.com> wrote:
> You bring up some interesting points.  I really need to digest this further.  
> From a high level I think I agree, but there are a lot of implied details of 
> what you've said.
> 
> Darren
> 
> 
> On Mon, Nov 25, 2013 at 8:39 AM, John Burwell <jburw...@basho.com> wrote:
> Darren,
> 
> I originally presented my thoughts on this subject at CCC13 [1].  
> Fundamentally, I see CloudStack as having two distinct tiers — orchestration 
> management and automation control.  The orchestration tier coordinates the 
> automation control layer to fulfill user goals (e.g. create a VM instance, 
> alter a network route, snapshot a volume, etc) constrained by policies 
> defined by the operator (e.g. multi-tenacy boundaries, ACLs, quotas, etc).  
> This layer must always be available to take new requests, and to report the 
> best available infrastructure state information.  Since execution of work is 
> guaranteed on completion of a request, this layer may pend work to be 
> completed when the appropriate devices become available.
> 
> The automation control tier translates logical units of work to underlying 
> infrastructure component APIs.  Upon completion of unit of work’s execution, 
> the state of a device (e.g. hypervisor, storage device, network switch, 
> router, etc) matches the state managed by the orchestration tier at the time 
> unit of work was created.  In order to ensure that the state of the 
> underlying devices remains consistent, these units of work must be executed 
> serially.  Permitting concurrent changes to resources creates race conditions 
> that lead to resource overcommitment and state divergence.   A symptom of 
> this phenomenon are the myriad of scripts operators write to “synchronize” 
> state between the CloudStack database and their hypervisors.  Another is the 
> example provided below is the rapid create-destroy which can (and often does) 
> leave dangling resources due to race conditions between the two operations.  
> 
> In order to provide reliability, CloudStack vertically partitions the 
> infrastructure into zones (independent power source/network uplink 
> combination) sub-divided into pods (racks).  At this time, regions are 
> largely notional, as such, as are not partitions at this time.  Between the 
> user’s zone selection and our allocators distribution of resources across 
> pods, the system attempts to distribute resources widely as possible across 
> these partitions to provide resilience against a variety infrastructure 
> failures (e.g. power loss, network uplink disruption, switch failures, etc).  
> In order maximize this resilience, the control plane (orchestration + 
> automation tiers) must be to operate on all available partitions.  For 
> example, if we have two (2) zones (A & B) and twenty (20) pods per zone, we 
> should be able to take and execute work in Zone A when one or more pods is 
> lost, as well as, when taking and executing work in Zone B when Zone B has 
> failed.
> 
> CloudStack is an eventually consistent system in that the state reflected in 
> the orchestration tier will (optimistically) differ from the state of the 
> underlying infrastructure (managed by the automation tier).  Furthermore, the 
> system has a partitioning model to provide resilience in the face of a 
> variety of logical and physical failures.  However, the automation control 
> tier requires strictly consistent operations.  Based on these definitions, 
> the system appears to violate the CAP theorem [2] (Brewer!).  The separation 
> of the system into two distinct tiers isolates these characteristics, but the 
> boundary between them must be carefully implemented to ensure that the 
> consistency requirements of the automation tier are not leaked to the 
> orchestration tier.
> 
> To properly implement this boundary, I think we should split the 
> orchestration and automation control tiers into separate physical processes 
> communicating via an RPC mechanism — allowing the automation control tier to 
> completely encapsulate its work distribution model.  In my mind, the tricky 
> wicket is providing serialization and partition tolerance in the automation 
> control tier.  Realistically, there two options — explicit and implicit 
> locking models.  Explicit locking models employ an external coordination 
> mechanism to coordinate exclusive access to resources (e.g. RDBMS lock 
> pattern, ZooKeeper, Hazelcast, etc).  The challenge with this model is 
> ensuring the availability of the locking mechanism in the face of partition — 
> forcing CloudStack operators to ensure that they have deployed the underlying 
> mechanism in a partition tolerant manner (e.g. don’t locate all of the 
> replicas in the same pod, deploy a cluster per zone, etc).  Additionally, the 
> durability introduced by these mechanisms inhibits the self-healing due to 
> lock staleness.
> 
> In contrast, an implicit lock model structures the runtime execution model to 
> provide exclusive access to a resource and model the partitioning scheme.  
> One such model is to provide a single work queue (mailbox) and consuming 
> process (actor) per resource.  The orchestration tier provides a description 
> of the partition and resource definitions to the automation control tier.  
> The automation control tier creates a supervisor per partition which in turn 
> manage process creation per resource.  Therefore, process creation and 
> destruction creates an implicit lock.  Since automation control tier does not 
> persist data in this model,  The crash of a supervisor and/or process 
> (supervisors are simply specialized processes) releases the implicit lock, 
> and signals a re-execution of the supervisor/process allocation process.  The 
> following high-level process describes creation allocation (hand waves 
> certain details such as back pressure and throttling):
> 
> The automation control layer receives a resource definition (e.g. zone 
> description, VM definition, volume information, etc).  These requests are 
> processed by the owning partition supervisor exclusively in order of receipt. 
>  Therefore, the automation control tier views the world as a tree of 
> partitions and resources.
> The partition supervisor creates the process (and the associated mailbox) — 
> providing it with the initial state.  The process state is Initialized.
> The process synchronizes the state of the underlying resource with the state 
> provided.  Upon successful completion of state synchronization, the state of 
> the process becomes Ready.  Only Ready processes can consume units of work 
> from their mailboxes.  The processes crashes.  All state transitions and 
> crashes are reported to interested parties through an asynchronous event 
> reporting mechanism including the id of the unit of work the device 
> represents.
> 
> The Ready state means that the underlying device is in a useable state 
> consistent with the last unit of work executed.  A process crashes when it is 
> unable to bring the device into a state consistent with the unit of work 
> being executed (a process crash also destroys the associated mailbox — 
> flushing pending work).  This event initiates execution of allocation process 
> (above) until the process can be re-allocated in a Ready state (again 
> throttling is hand waved for the purposes of brevity).  The state 
> synchronization step converges the actual state of the device with changes 
> that occurred during unavailability.  When a unit of work fails to be 
> executed, the orchestration tier determines the appropriate recovery strategy 
> (e.g. re-allocate work to another resource, wait for the availability of the 
> resource, fail the operation, etc).
> 
> The association of one process per resource provides exclusive access to the 
> resource without the requirement of an external locking mechanism.  A mailbox 
> per process provides orders pending units of work.  Together, they provide 
> serialization of operation execution.  In the example provided, a unit of 
> work would be submitted to create a VM and a second unit of work would be 
> submitted to destroy it.  The creation would be completely executed followed 
> by the destruction (assuming no failures).  Therefore, the VM will briefly 
> exist before being destroyed.  In conduction with a process location 
> mechanism, the system can place the processes associated with resources in 
> the appropriate partition allowing the system properly self heal, manage its 
> own scalability (thinking lightweight system VMs), and systematically enforce 
> partition tolerance (the operator was nice enough to describe their 
> infrastructure — we should use it to ensure resilience of CloudStack and 
> their infrastructure).
> 
> Until relatively recently, the implicit locking model described was 
> infeasible on the JVM.  Using native Java threads, a server would be limited 
> to controlling (at best) a few hundred resources.  However, lightweight 
> threading models implemented by libraries/frameworks such as Akka [3], Quasar 
> [4], and Erjang [5] can scale to millions of “threads” on reasonability sized 
> servers and provide the supervisor/actor/mailbox abstractions described 
> above.  Most importantly, this approach does not require operators to become 
> operationally knowledgeable of yet another platform/component.  In short, I 
> believe we can encapsulate these requirements in the management server 
> (orchestration + automation control tiers) — keeping the operational 
> footprint of the system proportional to the deployment without sacrificing 
> resilience.  Finally, it provides the foundation for proper collection of 
> instrumentation information and process control/monitoring across data 
> centers.
> 
> Admittedly, I have hand waved some significant issues that would beed to be 
> resolved.  I believe they are all resolvable, but it would take discussion to 
> determine the best approach to them.  Transforming CloudStack to such a model 
> would not be trivial, but I believe it would be worth the (significant) 
> effort as it would make CloudStack one of the most scalable and resilient 
> cloud orchestration/management platforms available.
> 
> Thanks,
> -John
> 
> [1]: 
> http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie-cloud-stack-distributed-process-management
> [2]: http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
> [3]: http://akka.io
> [4]: https://github.com/puniverse/quasar
> [5]: https://github.com/trifork/erjang/wiki
> 
> P.S.  I have CC’ed the developer mailing list.  All conversations at this 
> level of detail should be initiated and occur on the mailing list to ensure 
> transparency with the community.
> 
> On Nov 22, 2013, at 3:49 PM, Darren Shepherd <darren.s.sheph...@gmail.com> 
> wrote:
> 
>> 140 characters are not productive.  
>> 
>> What would be your idea way to do distributed concurrency control?  Simple 
>> use case.  Server 1 receives a request to start a VM 1.  Server 2 receives a 
>> request to delete VM 1.  What do you do?
>> 
>> Darren
> 
> 
> 

Reply via email to