Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown

Zane Bitter Thu, 11 Dec 2014 17:01:48 -0800

On 11/12/14 01:14, Anant Patil wrote:

On 04-Dec-14 10:49, Zane Bitter wrote:

On 01/12/14 02:02, Anant Patil wrote:

On GitHub:https://github.com/anantpatil/heat-convergence-poc


I'm trying to review this code at the moment, and finding some stuff I
don't understand:

https://github.com/anantpatil/heat-convergence-poc/blob/master/heat/engine/stack.py#L911-L916

This appears to loop through all of the resources *prior* to kicking off
any actual updates to check if the resource will change. This is
impossible to do in general, since a resource may obtain a property
value from an attribute of another resource and there is no way to know
whether an update to said other resource would cause a change in the
attribute value.

In addition, no attempt to catch UpdateReplace is made. Although that
looks like a simple fix, I'm now worried about the level to which this
code has been tested.

We were working on new branch and as we discussed on Skype, we have
handled all these cases. Please have a look at our current branch:
https://github.com/anantpatil/heat-convergence-poc/tree/graph-version

When a new resource is taken for convergence, its children are loaded
and the resource definition is re-parsed. The frozen resource definition
will have all the "get_attr" resolved.


I'm also trying to wrap my head around how resources are cleaned up in
dependency order. If I understand correctly, you store in the
ResourceGraph table the dependencies between various resource names in
the current template (presumably there could also be some left around
from previous templates too?). For each resource name there may be a
number of rows in the Resource table, each with an incrementing version.
As far as I can tell though, there's nowhere that the dependency graph
for _previous_ templates is persisted? So if the dependency order
changes in the template we have no way of knowing the correct order to
clean up in any more? (There's not even a mechanism to associate a
resource version with a particular template, which might be one avenue
by which to recover the dependencies.)

I think this is an important case we need to be able to handle, so I
added a scenario to my test framework to exercise it and discovered that
my implementation was also buggy. Here's the fix:
https://github.com/zaneb/heat-convergence-prototype/commit/786f367210ca0acf9eb22bea78fd9d51941b0e40


Thanks for pointing this out Zane. We too had a buggy implementation for
handling inverted dependency. I had a hard look at our algorithm where
we were continuously merging the edges from new template into the edges
from previous updates. It was an optimized way of traversing the graph
in both forward and reverse order with out missing any resources. But,
when the dependencies are inverted,  this wouldn't work.

We have changed our algorithm. The changes in edges are noted down in
DB, only the delta of edges from previous template is calculated and
kept. At any given point of time, the graph table has all the edges from
current template and delta from previous templates. Each edge has
template ID associated with it.

The thing is, the cleanup dependencies aren't really about the template.The real resources really depend on other real resources. You can'tdelete a Volume before its VolumeAttachment, not because it says so inthe template but because it will fail if you try. The template can giveus a rough guide in advance to what those dependencies will be, but ifthat's all we keep then we are discarding information.

There may be multiple versions of a resource corresponding to onetemplate version. Even worse, the actual dependencies of a resourcechange on a smaller time scale than an entire stack update (this is thereason the current implementation updates the template one resource at atime as we go).

Given that our Resource entries in the DB are in 1:1 correspondence withactual resources (we create a new one whenever we need to replace theunderlying resource), I found it makes the most conceptual and practicalsense to store the requirements in the resource itself, and update themat the time they actually change in the real world (bonus: introduces nonew locking issues and no extra DB writes). I settled on this after alegitimate attempt at trying other options, but they didn't work out:https://github.com/zaneb/heat-convergence-prototype/commit/a62958342e8583f74e2aca90f6239ad457ba984d

For resource clean up, we start from the
first template (template which was completed and updates were made on
top of it, empty template otherwise), and move towards the current
template in the order in which the updates were issued, and for each
template the graph (edges if found for the template) is traversed in
reverse order and resources are cleaned-up.

I'm pretty sure this is backwards - you'll need to clean up newerresources first because they may reference resources from oldertemplates. Also if you have a stubborn old resource that won't deleteyou don't want that to block cleanups of anything newer.

You're also serialising some of the work unnecessarily because you'vediscarded the information about dependencies that cross templateversions, forcing you to clean up only one template version at a time.

The process ends up with
current template being traversed in reverse order and resources being
cleaned up. All the update-replaced resources from the older templates
(older updates in concurrent updates) are cleaned up in the order in
which they are suppose to be.

Resources are now tied to template, they have a template_id instead of
version. As we traverse the graph, we know which template we are working
on, and can take the relevant action on resource.

For rollback, another update is issued with the last completed template
(it is designed to have an empty template as last completed template by
default). The current template being worked on becomes predecessor for
the new incoming template. In case of rollback, the last completed
template becomes incoming new template, the current becomes the new
template's predecessor and the successor of last completed template will
have no predecessor. All these changes are available in the
'graph-version' branch. (The branch name is a misnomer though!)

I think it is simpler to think about stack and concurrent updates when
we associate resources and edges with template, and stack with current
template and its predecessors (if any).

It doesn't seem simple to me because it's trying to reconstruct realityfrom a lossy version of history. The simplest way to think about it, inmy opinion is this:- When updating resources, respect their dependencies as given in thetemplate- When checking resources to clean up, respect their actual, currentreal-world dependencies, and check replacement resources before theresources that they replaced.- Don't check a resource for clean up until it has been updated to thelatest template.

I also think that we should decouple Resource from Stack. This is really
a hindrance when workers work on individual resources. The resource
should be abstracted enough from stack for the worker to work on the
resource alone. The worker should load the required resource plug-in and
start converging.

I think that's a worthy goal, and it would be really nice if we couldload a Resource completely independently of its Stack, and I know thishas always been a design goal of yours (hence you're caching theresource definition in the Resource rather than getting it from thetemplate).

That said, I am convinced it's an unachievable goal, and I believe weshould give up on it.

- We'll always need to load _some_ central thing (e.g. to find out ifthe current traversal is still the valid one), so it might as well bethe Stack.- Existing plugin abominations like HARestarter expect a working Stackobject to be provided so it can go hunting for other resources.

I think the best we can do is try to make heat.engine.stack.Stack aslazy as possible so that it only does extra work when strictly required,and just accept that the stack will always be loaded from the database.

I am also strongly in favour of treating the idea of caching theunresolved resource definition in the Resource table as a straightperformance optimisation that is completely separate to the convergencework. It's going to be inevitably ugly because there is notemplate-format-independent way to serialise a resource definition(while resource definition objects themselves are designed to beinherently template-format-independent). Once phase 1 is complete we candecide whether it's worth it based on measuring the actual performanceimprovement.

(Note that we _already_ store the _resolved_ properties of the resource,which is what the observer will be comparing against for phase 2, sothere should be no reason for the observer to need to load the stack.)

The READEME.rst is really helpful for bringing up the minimal devstack
and test the PoC. I also has some notes on design.

[snip]


Zane, I have few questions:
1. Our current implementation is based on notifications from worker so
that the engine can take up next set of tasks. I don't see this in your
case. I think we should be doing this. It gels well with observer
notification mechanism. When the observer comes, it would send a
converge notification. Both, the provisioning of stack and the
continuous observation, happens with notifications (async message
passing). I see that the workers in your case pick up the parent when/if
it is done and schedules it or updates the sync point.

I'm not quite sure what you're asking here, so forgive me if I'mmisunderstanding. What I think you're saying is that where my prototypepropagates notifications thus:


  worker -> worker
         -> worker

(where -> is an async message)
you would prefer it to do:

  worker -> engine -> worker
                   -> worker

Is that right?

To me the distinction seems somewhat academic, given that we've decidedthat the engine and the worker will be the same process. I don't see adisadvantage to doing right away stuff that we know needs to be doneright away. Obviously we should factor the code out tidily into aseparate method where we can _also_ expose it as a notification that canbe triggered by the continuous observer.

You mentioned above that you thought the workers should not ever loadthe Stack, and I think that's probably the reason you favour thisapproach: the 'worker' would always load just the Resource and the'engine' (even though they're really the same) would always load justthe Stack, right?

However, as I mentioned above, I think we'll want/have to load the Stackin the worker anyway, so eliminating the extra asynchronous calleliminates the performance penalty for having to do so.

2. The dependency graph travels everywhere. IMHO, we can keep the graph
in DB and let the workers work on a resource, and engine decide which
one to be scheduled next by looking at the graph. There wouldn't be a
need for a lock here, in the engine, the DB transactions should take
care of concurrent DB updates. Our new PoC follows this model.

I'm fine with keeping the graph in the DB instead of having it flow withthe notifications.

3. The request ID is passed down to check_*_complete. Would the check
method be interrupted if new request arrives? IMHO, the check method
should not get interrupted. It should return back when the resource has
reached a concrete state, either failed or completed.


I agree, it should not be interrupted.

I've started to think of phase 1 and phase 2 like this:

1) Make locks more granular: stack-level lock becomes resource-level
2) Get rid of locks altogether

So in phase 1 we'll lock the resources and like you said, it will returnback when it has reached a concrete state. In phase 2 we'll be able tojust update the goal state for the resource and the observe/convergeprocess will be able to automagically find the best way to that stateregardless of whether it was in the middle of transitioning to anotherstate or not. Or something. But that's for the future :)

4. Lot of synchronization issues which we faced in our PoC cannot be
encountered with the framework. How do we evaluate what happens when
synchronization issues are encountered (like stack lock kind of issues
which we are replacing with DB transaction).

Right, yeah, this is obviously the big known limitation of thesimulator. I don't have a better answer other than to Think Very Hardabout it.

Designing software means solving for hundreds of constraints - too manyfor a human to hold in their brain at the same time. The purpose ofprototyping is to fix enough of the responses to those constraints in aconcrete form to allow reasoning about the remaining ones to becometractable. If you fix solutions for *all* of the constraints, then whatyou've built is by definition not a prototype but the final product.

One technique available to us is to encapsulate the parts of thealgorithm that are subject to synchronisation issues behind abstractionsthat offer stronger guarantees. Then in order to have confidence in thedesign we need only satisfy ourselves that we have analysed theguarantees correctly and that a concrete implementation offering thosesame guarantees is possible. For example, the SyncPoints are shown towork under the assumption that they are not subject to race conditions,and the SyncPoint code is small enough that we can easily see that itcan be implemented in an atomic fashion using the same DB primitivesalready proven to work by StackLock. Therefore we can have a very highconfidence (but not proof) that the overall algorithm will work whenimplemented in the final product.

Having Thought Very Hard about it, I'm as confident as I can be that I'mnot relying on any synchronisation properties that can't be implementedusing select-for-update on a single database row. There will of coursebe surprises at implementation time, but I hope that won't be one ofthem and anticipate that any changes required to the plan will belocalised and not wide-ranging.

(This is in contrast BTW to my centralised-graph branch, linked above,where it became very obvious that it would require some sort of externallocking - so there is reason to think that this process can revealarchitectural problems related to synchronisation where they are present.)


cheers,
Zane.

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown

Reply via email to