On 11/12/14 01:14, Anant Patil wrote:
On 04-Dec-14 10:49, Zane Bitter wrote:
On 01/12/14 02:02, Anant Patil wrote:
On GitHub:https://github.com/anantpatil/heat-convergence-poc

I'm trying to review this code at the moment, and finding some stuff I
don't understand:

https://github.com/anantpatil/heat-convergence-poc/blob/master/heat/engine/stack.py#L911-L916

This appears to loop through all of the resources *prior* to kicking off
any actual updates to check if the resource will change. This is
impossible to do in general, since a resource may obtain a property
value from an attribute of another resource and there is no way to know
whether an update to said other resource would cause a change in the
attribute value.

In addition, no attempt to catch UpdateReplace is made. Although that
looks like a simple fix, I'm now worried about the level to which this
code has been tested.

We were working on new branch and as we discussed on Skype, we have
handled all these cases. Please have a look at our current branch:
https://github.com/anantpatil/heat-convergence-poc/tree/graph-version

When a new resource is taken for convergence, its children are loaded
and the resource definition is re-parsed. The frozen resource definition
will have all the "get_attr" resolved.


I'm also trying to wrap my head around how resources are cleaned up in
dependency order. If I understand correctly, you store in the
ResourceGraph table the dependencies between various resource names in
the current template (presumably there could also be some left around
from previous templates too?). For each resource name there may be a
number of rows in the Resource table, each with an incrementing version.
As far as I can tell though, there's nowhere that the dependency graph
for _previous_ templates is persisted? So if the dependency order
changes in the template we have no way of knowing the correct order to
clean up in any more? (There's not even a mechanism to associate a
resource version with a particular template, which might be one avenue
by which to recover the dependencies.)

I think this is an important case we need to be able to handle, so I
added a scenario to my test framework to exercise it and discovered that
my implementation was also buggy. Here's the fix:
https://github.com/zaneb/heat-convergence-prototype/commit/786f367210ca0acf9eb22bea78fd9d51941b0e40


Thanks for pointing this out Zane. We too had a buggy implementation for
handling inverted dependency. I had a hard look at our algorithm where
we were continuously merging the edges from new template into the edges
from previous updates. It was an optimized way of traversing the graph
in both forward and reverse order with out missing any resources. But,
when the dependencies are inverted,  this wouldn't work.

We have changed our algorithm. The changes in edges are noted down in
DB, only the delta of edges from previous template is calculated and
kept. At any given point of time, the graph table has all the edges from
current template and delta from previous templates. Each edge has
template ID associated with it.

The thing is, the cleanup dependencies aren't really about the template. The real resources really depend on other real resources. You can't delete a Volume before its VolumeAttachment, not because it says so in the template but because it will fail if you try. The template can give us a rough guide in advance to what those dependencies will be, but if that's all we keep then we are discarding information.

There may be multiple versions of a resource corresponding to one template version. Even worse, the actual dependencies of a resource change on a smaller time scale than an entire stack update (this is the reason the current implementation updates the template one resource at a time as we go).

Given that our Resource entries in the DB are in 1:1 correspondence with actual resources (we create a new one whenever we need to replace the underlying resource), I found it makes the most conceptual and practical sense to store the requirements in the resource itself, and update them at the time they actually change in the real world (bonus: introduces no new locking issues and no extra DB writes). I settled on this after a legitimate attempt at trying other options, but they didn't work out: https://github.com/zaneb/heat-convergence-prototype/commit/a62958342e8583f74e2aca90f6239ad457ba984d

For resource clean up, we start from the
first template (template which was completed and updates were made on
top of it, empty template otherwise), and move towards the current
template in the order in which the updates were issued, and for each
template the graph (edges if found for the template) is traversed in
reverse order and resources are cleaned-up.

I'm pretty sure this is backwards - you'll need to clean up newer resources first because they may reference resources from older templates. Also if you have a stubborn old resource that won't delete you don't want that to block cleanups of anything newer.

You're also serialising some of the work unnecessarily because you've discarded the information about dependencies that cross template versions, forcing you to clean up only one template version at a time.

The process ends up with
current template being traversed in reverse order and resources being
cleaned up. All the update-replaced resources from the older templates
(older updates in concurrent updates) are cleaned up in the order in
which they are suppose to be.

Resources are now tied to template, they have a template_id instead of
version. As we traverse the graph, we know which template we are working
on, and can take the relevant action on resource.

For rollback, another update is issued with the last completed template
(it is designed to have an empty template as last completed template by
default). The current template being worked on becomes predecessor for
the new incoming template. In case of rollback, the last completed
template becomes incoming new template, the current becomes the new
template's predecessor and the successor of last completed template will
have no predecessor. All these changes are available in the
'graph-version' branch. (The branch name is a misnomer though!)

I think it is simpler to think about stack and concurrent updates when
we associate resources and edges with template, and stack with current
template and its predecessors (if any).

It doesn't seem simple to me because it's trying to reconstruct reality from a lossy version of history. The simplest way to think about it, in my opinion is this: - When updating resources, respect their dependencies as given in the template - When checking resources to clean up, respect their actual, current real-world dependencies, and check replacement resources before the resources that they replaced. - Don't check a resource for clean up until it has been updated to the latest template.

I also think that we should decouple Resource from Stack. This is really
a hindrance when workers work on individual resources. The resource
should be abstracted enough from stack for the worker to work on the
resource alone. The worker should load the required resource plug-in and
start converging.

I think that's a worthy goal, and it would be really nice if we could load a Resource completely independently of its Stack, and I know this has always been a design goal of yours (hence you're caching the resource definition in the Resource rather than getting it from the template).

That said, I am convinced it's an unachievable goal, and I believe we should give up on it.

- We'll always need to load _some_ central thing (e.g. to find out if the current traversal is still the valid one), so it might as well be the Stack. - Existing plugin abominations like HARestarter expect a working Stack object to be provided so it can go hunting for other resources.

I think the best we can do is try to make heat.engine.stack.Stack as lazy as possible so that it only does extra work when strictly required, and just accept that the stack will always be loaded from the database.

I am also strongly in favour of treating the idea of caching the unresolved resource definition in the Resource table as a straight performance optimisation that is completely separate to the convergence work. It's going to be inevitably ugly because there is no template-format-independent way to serialise a resource definition (while resource definition objects themselves are designed to be inherently template-format-independent). Once phase 1 is complete we can decide whether it's worth it based on measuring the actual performance improvement.

(Note that we _already_ store the _resolved_ properties of the resource, which is what the observer will be comparing against for phase 2, so there should be no reason for the observer to need to load the stack.)

The READEME.rst is really helpful for bringing up the minimal devstack
and test the PoC. I also has some notes on design.

[snip]

Zane, I have few questions:
1. Our current implementation is based on notifications from worker so
that the engine can take up next set of tasks. I don't see this in your
case. I think we should be doing this. It gels well with observer
notification mechanism. When the observer comes, it would send a
converge notification. Both, the provisioning of stack and the
continuous observation, happens with notifications (async message
passing). I see that the workers in your case pick up the parent when/if
it is done and schedules it or updates the sync point.

I'm not quite sure what you're asking here, so forgive me if I'm misunderstanding. What I think you're saying is that where my prototype propagates notifications thus:

  worker -> worker
         -> worker

(where -> is an async message)
you would prefer it to do:

  worker -> engine -> worker
                   -> worker

Is that right?

To me the distinction seems somewhat academic, given that we've decided that the engine and the worker will be the same process. I don't see a disadvantage to doing right away stuff that we know needs to be done right away. Obviously we should factor the code out tidily into a separate method where we can _also_ expose it as a notification that can be triggered by the continuous observer.

You mentioned above that you thought the workers should not ever load the Stack, and I think that's probably the reason you favour this approach: the 'worker' would always load just the Resource and the 'engine' (even though they're really the same) would always load just the Stack, right?

However, as I mentioned above, I think we'll want/have to load the Stack in the worker anyway, so eliminating the extra asynchronous call eliminates the performance penalty for having to do so.

2. The dependency graph travels everywhere. IMHO, we can keep the graph
in DB and let the workers work on a resource, and engine decide which
one to be scheduled next by looking at the graph. There wouldn't be a
need for a lock here, in the engine, the DB transactions should take
care of concurrent DB updates. Our new PoC follows this model.

I'm fine with keeping the graph in the DB instead of having it flow with the notifications.

3. The request ID is passed down to check_*_complete. Would the check
method be interrupted if new request arrives? IMHO, the check method
should not get interrupted. It should return back when the resource has
reached a concrete state, either failed or completed.

I agree, it should not be interrupted.

I've started to think of phase 1 and phase 2 like this:

1) Make locks more granular: stack-level lock becomes resource-level
2) Get rid of locks altogether

So in phase 1 we'll lock the resources and like you said, it will return back when it has reached a concrete state. In phase 2 we'll be able to just update the goal state for the resource and the observe/converge process will be able to automagically find the best way to that state regardless of whether it was in the middle of transitioning to another state or not. Or something. But that's for the future :)

4. Lot of synchronization issues which we faced in our PoC cannot be
encountered with the framework. How do we evaluate what happens when
synchronization issues are encountered (like stack lock kind of issues
which we are replacing with DB transaction).

Right, yeah, this is obviously the big known limitation of the simulator. I don't have a better answer other than to Think Very Hard about it.

Designing software means solving for hundreds of constraints - too many for a human to hold in their brain at the same time. The purpose of prototyping is to fix enough of the responses to those constraints in a concrete form to allow reasoning about the remaining ones to become tractable. If you fix solutions for *all* of the constraints, then what you've built is by definition not a prototype but the final product.

One technique available to us is to encapsulate the parts of the algorithm that are subject to synchronisation issues behind abstractions that offer stronger guarantees. Then in order to have confidence in the design we need only satisfy ourselves that we have analysed the guarantees correctly and that a concrete implementation offering those same guarantees is possible. For example, the SyncPoints are shown to work under the assumption that they are not subject to race conditions, and the SyncPoint code is small enough that we can easily see that it can be implemented in an atomic fashion using the same DB primitives already proven to work by StackLock. Therefore we can have a very high confidence (but not proof) that the overall algorithm will work when implemented in the final product.

Having Thought Very Hard about it, I'm as confident as I can be that I'm not relying on any synchronisation properties that can't be implemented using select-for-update on a single database row. There will of course be surprises at implementation time, but I hope that won't be one of them and anticipate that any changes required to the plan will be localised and not wide-ranging.

(This is in contrast BTW to my centralised-graph branch, linked above, where it became very obvious that it would require some sort of external locking - so there is reason to think that this process can reveal architectural problems related to synchronisation where they are present.)

cheers,
Zane.

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to