[openstack-dev] [Heat] Convergence proof-of-concept showdown

Zane Bitter Wed, 26 Nov 2014 18:23:48 -0800

A bunch of us have spent the last few weeks working independently onproof of concept designs for the convergence architecture. I think thoseefforts have now reached a sufficient level of maturity that we shouldstart working together on synthesising them into a plan that everyonecan forge ahead with. As a starting point I'm going to summarise my takeon the three efforts; hopefully the authors of the other two will weighin to give us their perspective.


Zane's Proposal
===============

https://github.com/zaneb/heat-convergence-prototype/tree/distributed-graph

I implemented this as a simulator of the algorithm rather than using theHeat codebase itself in order to be able to iterate rapidly on thedesign, and indeed I have changed my mind many, many times in theprocess of implementing it. Its notable departure from a realisticsimulation is that it runs only one operation at a time - essentiallygiving up the ability to detect race conditions in exchange for acompletely deterministic test framework. You just have to imagine wherethe locks need to be. Incidentally, the test framework is designed sothat it can easily be ported to the actual Heat code base as functionaltests so that the same scenarios could be used without modification,allowing us to have confidence that the eventual implementation is afaithful replication of the simulation (which can be rapidlyexperimented on, adjusted and tested when we inevitably run intoimplementation issues).

This is a complete implementation of Phase 1 (i.e. using existingresource plugins), including update-during-update, resource clean-up,replace on update and rollback; with tests.


Some of the design goals which were successfully incorporated:

- Minimise changes to Heat (it's essentially a distributed version ofthe existing algorithm), and in particular to the database

- Work with the existing plugin API

- Limit total DB access for Resource/Stack to O(n) in the number ofresources

- Limit overall DB access to O(m) in the number of edges

- Limit lock contention to only those operations actually contending(i.e. no global locks)

- Each worker task deals with only one resource
- Only read resource attributes once

Open questions:

- What do we do when we encounter a resource that is in progress from aprevious update while doing a subsequent update? Obviously we don't wantto interrupt it, as it will likely be left in an unknown state. Making areplacement is one obvious answer, but in many cases there could beserious down-sides to that. How long should we wait before trying it?What if it's still in progress because the engine processing theresource already died?



Michał's Proposal
=================

https://github.com/inc0/heat-convergence-prototype/tree/iterative

Note that a version modified by me to use the same test scenario format(but not the same scenarios) is here:


https://github.com/zaneb/heat-convergence-prototype/tree/iterative-adapted

This is based on my simulation framework after a fashion, but witheverything implemented synchronously and a lot of handwaving about howthe actual implementation could be distributed. The central premise isthat at each step of the algorithm, the entire graph is examined fortasks that can be performed next, and those are then started. Once allare complete (it's synchronous, remember), the next step is run. Keenobservers will be asking how we know when it is time to run the nextstep in a distributed version of this algorithm, where it will be runand what to do about resources that are in an intermediate state at thattime. All of these questions remain unanswered.


A non-exhaustive list of concerns I have:
- Replace on update is not implemented yet
- AFAIK rollback is not implemented yet
- The simulation doesn't actually implement the proposed architecture
- This approach is punishingly heavy on the database - O(n^2) or worse

- A lot of phase 2 is mixed in with phase 1 here, making it difficult toevaluate which changes need to be made first and whether this approachworks with existing plugins- The code is not really based on how Heat works at the moment, so therewould be either a major redesign required or lots of radical changes inHeat or both

I think there's a fair chance that given another 3-4 weeks to work onthis, all of these issues and others could probably be resolved. Thequestion for me at this point is not so much "if" but "why".

Michał believes that this approach will make Phase 2 easier toimplement, which is a valid reason to consider it. However, I'm notaware of any particular issues that my approach would cause inimplementing phase 2 (note that I have barely looked into it at allthough). In fact, I very much want Phase 2 to be entirely encapsulatedby the Resource class, so that the plugin type (legacy vs.convergence-enabled) is transparent to the rest of the system. Only inthis way can we be sure that we'll be able to maintain support forlegacy plugins. So a phase 1 that mixes in aspects of phase 2 isactually a bad thing in my view.

I really appreciate the effort that has gone into this already, but inthe absence of specific problems with building phase 2 on top of anotherapproach that are solved by this one, I'm ready to call this a distraction.



Anant & Friends' Proposal
=========================

First off, I have found this very difficult to review properly since thecode is not separate from the huge mass of Heat code and nor is thecommit history in the form that patch submissions would take (but ratherincludes backtracking and iteration on the design). As a result, most ofthe information here has been gleaned from discussions about the coderather than direct review. I have repeatedly suggested that this proofof concept work should be done using the simulator framework instead,unfortunately so far to no avail.

The last we heard on the mailing list about this, resource clean-up hadnot yet been implemented. That was a major concern because that is themore difficult half of the algorithm. Since then there have been a lotmore commits, but it's not yet clear whether resource clean-up,update-during-update, replace-on-update and rollback have beenimplemented, though it is clear that at least some progress has beenmade on most or all of them. Perhaps someone can give us an update.

AIUI this code also mixes phase 2 with phase 1, which is a concern. Forme the highest priority for phase 1 is to be sure that it works withexisting plugins. Not only because we need to continue to support them,but because converting all of our existing 'integration-y' unit tests tofunctional tests that operate in a distributed system is virtuallyimpossible in the time frame we have available. So the existing testcode needs to stick around, and the existing stack create/update/deletemechanisms need to remain in place until such time as we have equivalentfunctional test coverage to begin eliminating existing unit tests.(We'll also, of course, need to have unit tests for the individualelements of the new distributed workflow, functional tests to confirmthat the distributed workflow works in principle as a whole - thescenarios from the simulator can help with _part_ of this - and, notleast, an algorithm that is as similar as possible to the current one sothat our existing tests remain at least somewhat representative anddon't require too many major changes themselves.)

Speaking of tests, I gathered that this branch included tests, but Idon't know to what extent there are automated end-to-end functionaltests of the algorithm?

From what I can gather, the approach seems broadly similar to the one Ieventually settled on also. The major difference appears to be in how wemerge two or more streams of execution (i.e. when one resource dependson two or more others). In my approach, the dependencies are stored inthe resources and each joining of streams creates a database row totrack it, which is easily locked with contention on the lock extendingonly to those resources which are direct dependencies of the onewaiting. In this approach, both the dependencies and the progressthrough the graph are stored in a database table, necessitating (a)reading of the entire table (as it relates to the current stack) onevery resource operation, and (b) locking of the entire table (which ishard) when marking a resource operation complete.

I chatted to Anant about this today and he mentioned that they hadsolved the locking problem by dispatching updates to a queue that isread by a single engine per stack.

My approach also has the neat side-effects of pushing the data requiredto resolve get_resource and get_att (without having to reload theresources again and query them) as well as to update dependencies (e.g.because of a replacement or deletion) along with the flow of triggers. Idon't know if anything similar is at work here.

It's entirely possible that the best design might combine elements ofboth approaches.

The same open questions I detailed under my proposal also apply to thisone, if I understand correctly.

I'm certain that I won't have represented everyone's work fairly here,so I encourage folks to dive in and correct any errors about theirs andask any questions you might have about mine. (In case you have beenliving under a rock, note that I'll be out of the office for the rest ofthe week due to Thanksgiving so don't expect immediate replies.)

I also think this would be a great time for the wider Heat community todive in and start asking questions and suggesting ideas. We need to,ahem, converge on a shared understanding of the design so we can all getto work delivering it for Kilo.


cheers,
Zane.

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] [Heat] Convergence proof-of-concept showdown

Reply via email to