Clint Byrum <cl...@fewbar.com> wrote on 05/29/2014 07:52:07 PM: > I am writing to get some brainstorming started on how we might mitigate > some of the issues we've seen while deploying large stacks on Heat. I am > sending this to the dev list because it may involve landing fixes rather > than just using different strategies. The problems outlined here are > well known and reported as bugs or feature requests, but there may be > more that we can do. > > ... > > Strategies: > > ... > > update-failure-recovery > ======================= > > This is a blueprint I believe Zane is working on to land in Juno. It will > allow us to retry a failed create or update action. Combined with the > separate controller/compute node strategy, this may be our best option, > but it is unclear whether that code will be available soon or not. The > chunking is definitely required, because with 500 compute nodes, if > node #250 fails, the remaining 249 nodes that are IN_PROGRESS will be > cancelled, which makes the impact of a transient failure quite extreme. > Also without chunking, we'll suffer from some of the performance > problems we've seen where a single engine process will have to do all of > the work to bring up a stack. > > Pros: * Uses blessed strategy > > Cons: * Implementation is not complete > * Still suffers from heavy impact of failure > * Requires chunking to be feasible
I like this one. As I remarked in the convergence discussion, I think the first step there is a DB schema change to separate desired and observed state. Once that is done, failure on one resource need not wedge a stack; non-dependent resources (like the peer compute nodes) can still be created. This does not address the issue of putting a lot of work in one process; that requires a more radical change. Regards, Mike
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev