Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown

Zane Bitter Mon, 15 Dec 2014 20:13:47 -0800

On 15/12/14 09:32, Anant Patil wrote:

On 12-Dec-14 06:29, Zane Bitter wrote:

On 11/12/14 01:14, Anant Patil wrote:

On 04-Dec-14 10:49, Zane Bitter wrote:

On 01/12/14 02:02, Anant Patil wrote:

On GitHub:https://github.com/anantpatil/heat-convergence-poc


I'm trying to review this code at the moment, and finding some stuff I
don't understand:

https://github.com/anantpatil/heat-convergence-poc/blob/master/heat/engine/stack.py#L911-L916

This appears to loop through all of the resources *prior* to kicking off
any actual updates to check if the resource will change. This is
impossible to do in general, since a resource may obtain a property
value from an attribute of another resource and there is no way to know
whether an update to said other resource would cause a change in the
attribute value.

In addition, no attempt to catch UpdateReplace is made. Although that
looks like a simple fix, I'm now worried about the level to which this
code has been tested.

We were working on new branch and as we discussed on Skype, we have
handled all these cases. Please have a look at our current branch:
https://github.com/anantpatil/heat-convergence-poc/tree/graph-version

When a new resource is taken for convergence, its children are loaded
and the resource definition is re-parsed. The frozen resource definition
will have all the "get_attr" resolved.


I'm also trying to wrap my head around how resources are cleaned up in
dependency order. If I understand correctly, you store in the
ResourceGraph table the dependencies between various resource names in
the current template (presumably there could also be some left around
from previous templates too?). For each resource name there may be a
number of rows in the Resource table, each with an incrementing version.
As far as I can tell though, there's nowhere that the dependency graph
for _previous_ templates is persisted? So if the dependency order
changes in the template we have no way of knowing the correct order to
clean up in any more? (There's not even a mechanism to associate a
resource version with a particular template, which might be one avenue
by which to recover the dependencies.)

I think this is an important case we need to be able to handle, so I
added a scenario to my test framework to exercise it and discovered that
my implementation was also buggy. Here's the fix:
https://github.com/zaneb/heat-convergence-prototype/commit/786f367210ca0acf9eb22bea78fd9d51941b0e40


Thanks for pointing this out Zane. We too had a buggy implementation for
handling inverted dependency. I had a hard look at our algorithm where
we were continuously merging the edges from new template into the edges
from previous updates. It was an optimized way of traversing the graph
in both forward and reverse order with out missing any resources. But,
when the dependencies are inverted,  this wouldn't work.

We have changed our algorithm. The changes in edges are noted down in
DB, only the delta of edges from previous template is calculated and
kept. At any given point of time, the graph table has all the edges from
current template and delta from previous templates. Each edge has
template ID associated with it.


The thing is, the cleanup dependencies aren't really about the template.
The real resources really depend on other real resources. You can't
delete a Volume before its VolumeAttachment, not because it says so in
the template but because it will fail if you try. The template can give
us a rough guide in advance to what those dependencies will be, but if
that's all we keep then we are discarding information.

There may be multiple versions of a resource corresponding to one
template version. Even worse, the actual dependencies of a resource
change on a smaller time scale than an entire stack update (this is the
reason the current implementation updates the template one resource at a
time as we go).


Absolutely! The edges from the template are kept only for the reference
purposes. When we have a resource in new template, its template ID will
also be marked to current template. At any point of time, realized
resource will from current template, even if it were found in previous
templates. The template ID "moves" for a resource if it is found.

In theory (disclaimer: I didn't implement this yet) it can change on aneven smaller timescale than that. The existing plugins are something ofa black box to us: if a failure occurs we don't necessarily know whetherthe real-world dependency is on the old or new version of another resource.

Given that our Resource entries in the DB are in 1:1 correspondence with
actual resources (we create a new one whenever we need to replace the
underlying resource), I found it makes the most conceptual and practical
sense to store the requirements in the resource itself, and update them
at the time they actually change in the real world (bonus: introduces no
new locking issues and no extra DB writes). I settled on this after a
legitimate attempt at trying other options, but they didn't work out:
https://github.com/zaneb/heat-convergence-prototype/commit/a62958342e8583f74e2aca90f6239ad457ba984d


I am okay with the notion of graph stored in resource table.

For resource clean up, we start from the
first template (template which was completed and updates were made on
top of it, empty template otherwise), and move towards the current
template in the order in which the updates were issued, and for each
template the graph (edges if found for the template) is traversed in
reverse order and resources are cleaned-up.


I'm pretty sure this is backwards - you'll need to clean up newer
resources first because they may reference resources from older
templates. Also if you have a stubborn old resource that won't delete
you don't want that to block cleanups of anything newer.

You're also serialising some of the work unnecessarily because you've
discarded the information about dependencies that cross template
versions, forcing you to clean up only one template version at a time.


Since only the changed set is stored, it is going to be smaller in most
of the cases. Within each previous template, if there are any edges,
they are tried for clean-up in reverse order. If a resource is not
updated, but available in new template, the template ID will be current
template ID as I mentioned above. With the template, the process is
concurrent, but across the templates, it will be serial. I am okay with
this since in real world, the change in dependencies may not be as much
as we think.

I agree the serialisation probably won't make a big difference in mostcases, although autoscaling is an obvious and important one whereserialising a lot of things unnecessarily could have a big impact on theuser.

The process ends up with
current template being traversed in reverse order and resources being
cleaned up. All the update-replaced resources from the older templates
(older updates in concurrent updates) are cleaned up in the order in
which they are suppose to be.

Resources are now tied to template, they have a template_id instead of
version. As we traverse the graph, we know which template we are working
on, and can take the relevant action on resource.

For rollback, another update is issued with the last completed template
(it is designed to have an empty template as last completed template by
default). The current template being worked on becomes predecessor for
the new incoming template. In case of rollback, the last completed
template becomes incoming new template, the current becomes the new
template's predecessor and the successor of last completed template will
have no predecessor. All these changes are available in the
'graph-version' branch. (The branch name is a misnomer though!)

I think it is simpler to think about stack and concurrent updates when
we associate resources and edges with template, and stack with current
template and its predecessors (if any).


It doesn't seem simple to me because it's trying to reconstruct reality
from a lossy version of history. The simplest way to think about it, in
my opinion is this:
- When updating resources, respect their dependencies as given in the
template
- When checking resources to clean up, respect their actual, current
real-world dependencies, and check replacement resources before the
resources that they replaced.
- Don't check a resource for clean up until it has been updated to the
latest template.

I also think that we should decouple Resource from Stack. This is really
a hindrance when workers work on individual resources. The resource
should be abstracted enough from stack for the worker to work on the
resource alone. The worker should load the required resource plug-in and
start converging.


I think that's a worthy goal, and it would be really nice if we could
load a Resource completely independently of its Stack, and I know this
has always been a design goal of yours (hence you're caching the
resource definition in the Resource rather than getting it from the
template).

That said, I am convinced it's an unachievable goal, and I believe we
should give up on it.

- We'll always need to load _some_ central thing (e.g. to find out if
the current traversal is still the valid one), so it might as well be
the Stack.
- Existing plugin abominations like HARestarter expect a working Stack
object to be provided so it can go hunting for other resources.

I think the best we can do is try to make heat.engine.stack.Stack as
lazy as possible so that it only does extra work when strictly required,
and just accept that the stack will always be loaded from the database.

I am also strongly in favour of treating the idea of caching the
unresolved resource definition in the Resource table as a straight
performance optimisation that is completely separate to the convergence
work. It's going to be inevitably ugly because there is no
template-format-independent way to serialise a resource definition
(while resource definition objects themselves are designed to be
inherently template-format-independent). Once phase 1 is complete we can
decide whether it's worth it based on measuring the actual performance
improvement.


The idea of keeping the resource definition in DB was to decouple from
the stack as much as possible. It is not for performance optimization.


OK, that's fair, but then you said...

When a worker job is received, the worker won't have to load the entire
template from the stack, get the realized definitions etc.


...that's a performance optimisation.

The worker
simply loads the resource with definition with minimal stack (lazily
loaded stack), and starts working on it. This work is towards making
stack lazily loaded.

Like you said, the stack has to be minimal when it is loaded. There will
be few information needed by the worker (like resource definition and
requirers/requirements) which can be obtained without loading the
template and recalculating the graph again and again.

None of this changes the fundamental design, so let's get the importantstuff done first and when that's finished we can see if this makes ameaningful (positive) difference.

(Note that we _already_ store the _resolved_ properties of the resource,
which is what the observer will be comparing against for phase 2, so
there should be no reason for the observer to need to load the stack.)

The READEME.rst is really helpful for bringing up the minimal devstack
and test the PoC. I also has some notes on design.

[snip]


Zane, I have few questions:
1. Our current implementation is based on notifications from worker so
that the engine can take up next set of tasks. I don't see this in your
case. I think we should be doing this. It gels well with observer
notification mechanism. When the observer comes, it would send a
converge notification. Both, the provisioning of stack and the
continuous observation, happens with notifications (async message
passing). I see that the workers in your case pick up the parent when/if
it is done and schedules it or updates the sync point.


I'm not quite sure what you're asking here, so forgive me if I'm
misunderstanding. What I think you're saying is that where my prototype
propagates notifications thus:

    worker -> worker
           -> worker

(where -> is an async message)
you would prefer it to do:

    worker -> engine -> worker
                     -> worker

Is that right?

To me the distinction seems somewhat academic, given that we've decided
that the engine and the worker will be the same process. I don't see a
disadvantage to doing right away stuff that we know needs to be done
right away. Obviously we should factor the code out tidily into a
separate method where we can _also_ expose it as a notification that can
be triggered by the continuous observer.

You mentioned above that you thought the workers should not ever load
the Stack, and I think that's probably the reason you favour this
approach: the 'worker' would always load just the Resource and the
'engine' (even though they're really the same) would always load just
the Stack, right?

However, as I mentioned above, I think we'll want/have to load the Stack
in the worker anyway, so eliminating the extra asynchronous call
eliminates the performance penalty for having to do so.


When we started, we had the following idea:


             (provisions with
              converge API)             (Observe)
     Engine ------------------> Worker --------------> Observer
       |                         | |                        |
       |                         | |                        |
       ^     (done)              v ^         (converge)     v
        ------------------------    ------------------------

The boundary has to be clearly demarcated, as done above. While
provisioning a stack, the Engine uses the converge facility, as does the
observer when it sees a need to converge a resource.

Even though we have these logical things in same process, by not clearly
demarcating the responsibilities of each logical entity, we will end-up
with issues which we face when we mix the responsibilities.

Sure, and that's why we don't write all the code in one big method, butit doesn't follow that it has to be in a different process. Or in thiscase, since they're actually the same process anyway, it doesn't implythat they have to be called over RPC.

As you mentioned, the worker should have a notification API exposed
which can be used not only by the Observer, but also by the engine to
provision (CRUD) a stack resource.

yep

The engine decides on which resource
to be converged next when a resource is done.

Who says it has to? Each resource knows what depends on it... let itsend the notifications directly. The engine is just there to do whateversetup requires everything to be loaded at once and then kick things offby starting the worker tasks for resources that don't depend on anything.

The observer is only
responsible for once resource and issues a converge request if it has
to.

2. The dependency graph travels everywhere. IMHO, we can keep the graph
in DB and let the workers work on a resource, and engine decide which
one to be scheduled next by looking at the graph. There wouldn't be a
need for a lock here, in the engine, the DB transactions should take
care of concurrent DB updates. Our new PoC follows this model.


I'm fine with keeping the graph in the DB instead of having it flow with
the notifications.

3. The request ID is passed down to check_*_complete. Would the check
method be interrupted if new request arrives? IMHO, the check method
should not get interrupted. It should return back when the resource has
reached a concrete state, either failed or completed.


I agree, it should not be interrupted.

I've started to think of phase 1 and phase 2 like this:

1) Make locks more granular: stack-level lock becomes resource-level
2) Get rid of locks altogether


I thought that the DB transactions were enough for concurrency issues.
The resources status is enough to know whether to trigger the update or
not. For resources currently in progress, the engine can wait for
notifications from then and trigger the update. With completely
distributed graph and decision making it is hard to achieve. If the
graph is stored in DB, the decision to trigger update is at one place
(by looking at the graph only) and hence locks are not required.

Right, right, but it's still a lock (maybe I should have said mutex)because only one traversal can be operating on each resource at a time.We're just arguing semantics now. It doesn't matter whether you doselect-for-update on the Resource table to mark it as in progress or ifyou do select-for-update on a separate Lock table to mark it as inprogress, only one update can be working on a given resource at a time,and this is a big improvement in granularity because before only oneupdate could be working on the whole stack at a time, but not as big animprovement as phase 2 will be.

So in phase 1 we'll lock the resources and like you said, it will return
back when it has reached a concrete state. In phase 2 we'll be able to
just update the goal state for the resource and the observe/converge
process will be able to automagically find the best way to that state
regardless of whether it was in the middle of transitioning to another
state or not. Or something. But that's for the future :)


I am thinking about not having any locks at all. The concurrency is
handled by DB transaction and notifications gives us hint on deciding
when and whether to update the next set of resources. If a resource is
in progress currently, and an update is issued on that resource, the
graph API doesn't return it as a ready-to-be-converged resource as a
previous version is in progress. When it is done, and notification is
received, the graph DB API returns it as next to converge.


So sort of a hybrid with Michal's approach.

I think I get what you're saying here. If we reach a particular resourcein the traverse and it is still in-progress from a previous update wecan ignore it and use the notification from the previous updatefinishing to trigger it later, provided that the notifications gothrough the engine because the in-progress resource can't possibly knowanything about traversals that started later.

Or on the other hand we could just sleep for a while until it's done andthen carry on.

4. Lot of synchronization issues which we faced in our PoC cannot be
encountered with the framework. How do we evaluate what happens when
synchronization issues are encountered (like stack lock kind of issues
which we are replacing with DB transaction).


Right, yeah, this is obviously the big known limitation of the
simulator. I don't have a better answer other than to Think Very Hard
about it.

Designing software means solving for hundreds of constraints - too many
for a human to hold in their brain at the same time. The purpose of
prototyping is to fix enough of the responses to those constraints in a
concrete form to allow reasoning about the remaining ones to become
tractable. If you fix solutions for *all* of the constraints, then what
you've built is by definition not a prototype but the final product.

One technique available to us is to encapsulate the parts of the
algorithm that are subject to synchronisation issues behind abstractions
that offer stronger guarantees. Then in order to have confidence in the
design we need only satisfy ourselves that we have analysed the
guarantees correctly and that a concrete implementation offering those
same guarantees is possible. For example, the SyncPoints are shown to
work under the assumption that they are not subject to race conditions,
and the SyncPoint code is small enough that we can easily see that it
can be implemented in an atomic fashion using the same DB primitives
already proven to work by StackLock. Therefore we can have a very high
confidence (but not proof) that the overall algorithm will work when
implemented in the final product.

Having Thought Very Hard about it, I'm as confident as I can be that I'm
not relying on any synchronisation properties that can't be implemented
using select-for-update on a single database row. There will of course
be surprises at implementation time, but I hope that won't be one of
them and anticipate that any changes required to the plan will be
localised and not wide-ranging.

(This is in contrast BTW to my centralised-graph branch, linked above,
where it became very obvious that it would require some sort of external
locking - so there is reason to think that this process can reveal
architectural problems related to synchronisation where they are present.)

cheers,
Zane.



_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown

Reply via email to