Excerpts from Zane Bitter's message of 2014-08-11 08:16:56 -0700:
> On 11/08/14 10:46, Clint Byrum wrote:
> > Right now we're stuck with an update that just doesn't work. It isn't
> > just about update-failure-recovery, which is coming along nicely, but
> > it is also about the lack of signals to control rebuild, poor support
> > for addressing machines as groups, and unacceptable performance in
> > large stacks.
> Are there blueprints/bugs filed for all of these issues?

Convergnce addresses the poor performance for large stacks in general.
We also have this:


Which shows how slow metadata access can get. I have worked on patches
but haven't been able to complete them. We made big strides but we are
at a point where 40 nodes polling Heat every 30s is too much for one CPU
to handle. When we scaled Heat out onto more CPUs on one box by forking
we ran into eventlet issues. We also ran into issues because even with
many processes we can only use one to resolve templates for a single
stack during update, which was also excessively slow.

We haven't been able to come back around to those yet, but you can see
where this has turned into a bit of a rat hole of optimization.

action-aware-sw-config is sort of what we want for rebuild. We
collaborated with the trove devs on how to also address it for resize
a while back but I have lost track of that work as it has taken a back
seat to more pressing issues.

Addressing groups is a general problem that I've had a hard time
articulating in the past. Tomas Sedovic has done a good job with this
TripleO spec, but I don't know that we've asked for an explicit change
in a bug or spec in Heat just yet:


There are a number of other issues noted in that spec which are already
addressed in Heat, but require refactoring in TripleO's templates and
tools, and that work continues.

The point remains: we need something that works now, and doing an
alternate implementation for updates is actually faster than addressing
all of these issues.

