Clint Byrum <> wrote on 05/29/2014 07:52:07 PM:

> I am writing to get some brainstorming started on how we might mitigate
> some of the issues we've seen while deploying large stacks on Heat. I am
> sending this to the dev list because it may involve landing fixes rather
> than just using different strategies. The problems outlined here are
> well known and reported as bugs or feature requests, but there may be
> more that we can do.
> ...
> Strategies:
> ...
> update-failure-recovery
> =======================
> This is a blueprint I believe Zane is working on to land in Juno. It 
> allow us to retry a failed create or update action. Combined with the
> separate controller/compute node strategy, this may be our best option,
> but it is unclear whether that code will be available soon or not. The
> chunking is definitely required, because with 500 compute nodes, if
> node #250 fails, the remaining 249 nodes that are IN_PROGRESS will be
> cancelled, which makes the impact of a transient failure quite extreme.
> Also without chunking, we'll suffer from some of the performance
> problems we've seen where a single engine process will have to do all of
> the work to bring up a stack.
> Pros: * Uses blessed strategy
> Cons: * Implementation is not complete
>       * Still suffers from heavy impact of failure
>       * Requires chunking to be feasible

I like this one.  As I remarked in the convergence discussion, I think the 
first step there is a DB schema change to separate desired and observed 
state.  Once that is done, failure on one resource need not wedge a stack; 
non-dependent resources (like the peer compute nodes) can still be 

This does not address the issue of putting a lot of work in one process; 
that requires a more radical change.


OpenStack-dev mailing list

Reply via email to