Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown

Zane Bitter Mon, 15 Dec 2014 20:19:34 -0800

On 15/12/14 10:15, Anant Patil wrote:

On 13-Dec-14 05:42, Zane Bitter wrote:

On 12/12/14 05:29, Murugan, Visnusaran wrote:

-----Original Message-----
From: Zane Bitter [mailto:zbit...@redhat.com]
Sent: Friday, December 12, 2014 6:37 AM
To: openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept
showdown

On 11/12/14 08:26, Murugan, Visnusaran wrote:

[Murugan, Visnusaran]
In case of rollback where we have to cleanup earlier version of
resources,

we could get the order from old template. We'd prefer not to have a
graph table.

In theory you could get it by keeping old templates around. But that
means keeping a lot of templates, and it will be hard to keep track
of when you want to delete them. It also means that when starting an
update you'll need to load every existing previous version of the
template in order to calculate the dependencies. It also leaves the
dependencies in an ambiguous state when a resource fails, and
although that can be worked around it will be a giant pain to implement.


Agree that looking to all templates for a delete is not good. But
baring Complexity, we feel we could achieve it by way of having an
update and a delete stream for a stack update operation. I will
elaborate in detail in the etherpad sometime tomorrow :)

I agree that I'd prefer not to have a graph table. After trying a
couple of different things I decided to store the dependencies in the
Resource table, where we can read or write them virtually for free
because it turns out that we are always reading or updating the
Resource itself at exactly the same time anyway.


Not sure how this will work in an update scenario when a resource does
not change and its dependencies do.


We'll always update the requirements, even when the properties don't
change.


Can you elaborate a bit on rollback.


I didn't do anything special to handle rollback. It's possible that we
need to - obviously the difference in the UpdateReplace + rollback case
is that the replaced resource is now the one we want to keep, and yet
the replaced_by/replaces dependency will force the newer (replacement)
resource to be checked for deletion first, which is an inversion of the
usual order.


This is where the version is so handy! For UpdateReplaced ones, there is


*sometimes*

an older version to go back to. This version could just be template ID,
as I mentioned in another e-mail. All resources are at the current
template ID if they are found in the current template, even if they is
no need to update them. Otherwise, they need to be cleaned-up in the
order given in the previous templates.

I think the template ID is used as version as far as I can see in Zane's
PoC. If the resource template key doesn't match the current template
key, the resource is deleted. The version is misnomer here, but that
field (template id) is used as though we had versions of resources.

Correct. Because if we had wanted to keep it, we would have updated itto the new template version already.

However, I tried to think of a scenario where that would cause problems
and I couldn't come up with one. Provided we know the actual, real-world
dependencies of each resource I don't think the ordering of those two
checks matters.

In fact, I currently can't think of a case where the dependency order
between replacement and replaced resources matters at all. It matters in
the current Heat implementation because resources are artificially
segmented into the current and backup stacks, but with a holistic view
of dependencies that may well not be required. I tried taking that line
out of the simulator code and all the tests still passed. If anybody can
think of a scenario in which it would make a difference, I would be very
interested to hear it.

In any event though, it should be no problem to reverse the direction of
that one edge in these particular circumstances if it does turn out to
be a problem.

We had an approach with depends_on
and needed_by columns in ResourceTable. But dropped it when we figured out
we had too many DB operations for Update.


Yeah, I initially ran into this problem too - you have a bunch of nodes
that are waiting on the current node, and now you have to go look them
all up in the database to see what else they're waiting on in order to
tell if they're ready to be triggered.

It turns out the answer is to distribute the writes but centralise the
reads. So at the start of the update, we read all of the Resources,
obtain their dependencies and build one central graph[1]. We than make
that graph available to each resource (either by passing it as a
notification parameter, or storing it somewhere central in the DB that
they will all have to read anyway, i.e. the Stack). But when we update a
dependency we don't update the central graph, we update the individual
Resource so there's no global lock required.

[1]
https://github.com/zaneb/heat-convergence-prototype/blob/distributed-graph/converge/stack.py#L166-L168


A centralized graph and decision making will make the implementation far
more simpler than distributed.

The centralised graph is not in dispute; my approach has a centralisedgraph that gets passed down through oslo.messaging notifications, andI've already said I'm fine with putting that in the database instead. SoI don't know what you're arguing against, but it isn't what I've beensaying.

And the decision-making is only centralised if you send all thenotifications to a single engine. Maybe this is why I'm so confused,because I thought you had dropped that idea after Clint argued veryconvincingly against it.

This looks academic, but the simplicity
beats everything! When each worker has to decide, there needs to be
lock, only DB transactions are not enough.

It needs select-for-update and that is all. I don't see how either locksor transactions come into it.

In contrast, when the
decision making is centralized, that particular critical section can be
attempted with transaction and re-attempted if needed.

That makes no sense... if you need to re-attempt a transaction it canonly be because the decision is not centralised.

With the distributed approach, I see following drawbacks:

I have no idea what distributed approach you're referring to, or indeedwhat 'centralised' approach you're referring to.

1. Every time a resource is done, the peer resources (siblings) are
checked to see if they are done and the parent is propagated. This
happens for each resource.

As Clint mentioned, this is not correct. In your terminology, eachparent is checked to see if all of its children are done and if they arethen the notification is propagated.

2. The worker has to run through all the resources to see if the stack
is done, to mark it as completed.

Not true, the worker can send a notification to the Stack to indicatethat everything it was waiting for is done, in exactly the same way thatit can send notifications to the next Resource to say that everything itwas waiting for is done. I already implemented this:


https://github.com/zaneb/heat-convergence-prototype/commit/d2be14d2f711ece927cbf0f9382d744a3e881d4c

(It has been refactored a little since then.)

3. The decision to converge is made by each worker resulting in lot of
contention. The centralized graph restricts the contention point to one
place where we can use DB transactions. It is easier to maintain code
where particular decisions are made at a place rather than at many
places.

That's backwards. Where everything is fighting for the same centralisedpoint you have a lot of contention. Where only the things that canconflict ever meet at the same point, contention is reduced to theminimum possible.

Furthermore, just because it is happening different rows in the databasedoesn't mean it is happening in different parts of the code. e.g. in myimplementation it always happens in sync_point.py.

4. The complex part we are trying to solve is to decide on what to do
next when a resource is done. With centralized graph, this is abstracted
out to the DB API. The API will return the next set of nodes. A smart
SQL query can reduce a lot of logic currently being coded in
worker/engine.

I guess that was meant to sound like a good thing, but no part of thatsounded attractive to me.

Fancy queries are good for when you have a large amount of data of whichyou need to work with a subset. When you need all of the data then it'sbetter to work with it in code, and when you already know the primarykey of the data you need it's much much much better to just grab it.

And that's how my code works. At the start of the update, we grab all ofthe data and figure out what to do with it in code, build the graph, anduse the primary keys stored in the graph to traverse through it, at eachstep only grabbing the one row we need.

5. What would be the starting point for resource clean-up?

There's no fixed starting point, cleaning up is part of the graph and ithappens when things it depends on are complete, which is different foreach resource.

The clean-up
has to start when all the resources are updated.


That's wrong.

With no centralized
graph, the DB has to be searched for all the resources with no
dependencies and with older versions (or having older template keys) and
start removing them.

And we do this search by loading _all_ of the existing resources at thebeginning of the Stack update (which we have to do anyway!), calculatingthe graph and storing it centrally.

With centralized graph, this would be a simpler
with a SQL queries returning what needs to be done.


Nothing could be simpler than getting rows by key where you need them.

The search space for
where to start with clean-up will be huge.

It will be 0, because nodes to be cleaned up are nodes like any other inthe graph and we don't have to search for them, we have their IDs.

6. When engine restarts, the search space on where to start will be
huge. With a centralized graph, the abstracted API to get next set of
nodes makes the implementation of decision simpler.

If an engine dies I'm fine with starting again at the beginning. Or wecan just search for resources that are IN_PROGRESS with the engine_id ofthe engine that died.

I am convinced enough that it is simpler to assign the responsibility to
engine on what needs to be done next. No locks will be required, not
even resource locks! It is simpler from implementation, understanding
and maintenance perspective.

Again, this is just semantics. The only thing I'm proposing we use isselect-for-update to do atomic updates.

Also taking care of deleting resources in order will be an issue.


It works fine.

This implies that there will be different versions of a resource which
will even complicate further.


No it doesn't, other than the different versions we already have due to
UpdateReplace.

This approach reduces DB queries by waiting for completion
notification

on a topic. The drawback I see is that delete stack stream will be
huge as it will have the entire graph. We can always dump such data
in ResourceLock.data Json and pass a simple flag
"load_stream_from_db" to converge RPC call as a workaround for delete

operation.


This seems to be essentially equivalent to my 'SyncPoint'
proposal[1], with

the key difference that the data is stored in-memory in a Heat engine
rather than the database.


I suspect it's probably a mistake to move it in-memory for similar
reasons to the argument Clint made against synchronising the marking
off

of dependencies in-memory. The database can handle that and the
problem of making the DB robust against failures of a single machine
has already been solved by someone else. If we do it in-memory we are
just creating a single point of failure for not much gain. (I guess
you could argue it doesn't matter, since if any Heat engine dies
during the traversal then we'll have to kick off another one anyway,
but it does limit our options if that changes in the
future.) [Murugan, Visnusaran] Resource completes, removes itself
from resource_lock and notifies engine. Engine will acquire parent
lock and initiate parent only if all its children are satisfied (no child entry 
in

resource_lock).

This will come in place of Aggregator.

Yep, if you s/resource_lock/SyncPoint/ that's more or less exactly what I

did.

The three differences I can see are:

1) I think you are proposing to create all of the sync points at the
start of the traversal, rather than on an as-needed basis. This is
probably a good idea. I didn't consider it because of the way my
prototype evolved, but there's now no reason I can see not to do this.
If we could move the data to the Resource table itself then we could
even get it for free from an efficiency point of view.


+1. But we will need engine_id to be stored somewhere for recovery

purpose (easy to be queried format).

Yeah, so I'm starting to think you're right, maybe the/a Lock table is the right
thing to use there. We could probably do it within the resource table using
the same select-for-update to set the engine_id, but I agree that we might
be starting to jam too much into that one table.


yeah. Unrelated values in resource table. Upon resource completion we have to
unset engine_id as well as compared to dropping a row from resource lock.
Both are good. Having engine_id in resource_table will reduce db operaions
in half. We should go with just resource table along with engine_id.

OK

Sync points are created as-needed. Single resource is enough to restart

that entire stream.

I think there is a disconnect in our understanding. I will detail it as well in

the etherpad.

OK, that would be good.

2) You're using a single list from which items are removed, rather
than two lists (one static, and one to which items are added) that get

compared.

Assuming (1) then this is probably a good idea too.


Yeah. We have a single list per active stream which work by removing
Complete/satisfied resources from it.


I went to change this and then remembered why I did it this way: the sync
point is also storing data about the resources that are triggering it. Part of 
this
is the RefID and attributes, and we could replace that by storing that data in
the Resource itself and querying it rather than having it passed in via the
notification. But the other part is the ID/key of those resources, which we
_need_ to know in order to update the requirements in case one of them
has been replaced and thus the graph doesn't reflect it yet. (Or, for that
matter, we need it to know where to go looking for the RefId and/or
attributes if they're in the
DB.) So we have to store some data, we can't just remove items from the
required list (although we could do that as well).

3) You're suggesting to notify the engine unconditionally and let the
engine decide if the list is empty. That's probably not a good idea -
not only does it require extra reads, it introduces a race condition
that you then have to solve (it can be solved, it's just more work).
Since the update to remove a child from the list is atomic, it's best
to just trigger the engine only if the list is now empty.


No. Notify only if stream has something to be processed. The newer
Approach based on db lock will be that the last resource will initiate its

parent.

This is opposite to what our Aggregator model had suggested.


OK, I think we're on the same page on this one then.



Yeah.

It's not clear to me how the 'streams' differ in practical terms
from just passing a serialisation of the Dependencies object, other
than being incomprehensible to me ;). The current Dependencies
implementation
(1) is a very generic implementation of a DAG, (2) works and has
plenty of

unit tests, (3) has, with I think one exception, a pretty
straightforward API,
(4) has a very simple serialisation, returned by the edges() method,
which can be passed back into the constructor to recreate it, and (5)
has an API that is to some extent relied upon by resources, and so
won't likely be removed outright in any event.

Whatever code we need to handle dependencies ought to just build on

this existing implementation.

[Murugan, Visnusaran] Our thought was to reduce payload size

(template/graph). Just planning for worst case scenario (million
resource
stack) We could always dump them in ResourceLock.data to be loaded by
Worker.


With the latest updates to the Etherpad, I'm even more confused by
streams than I was before.

One thing I never understood is why do you need to store the whole path
to reach each node in the graph? Surely you only need to know the nodes
this one is waiting on, the nodes waiting on this one and the ones those
are waiting on, not the entire history up to this point. The size of
each stream is theoretically up to O(n^2) and you're storing n of them -
that's going to get painful in this million-resource stack.

If there's a smaller representation of a graph than a list of edges
then I don't know what it is. The proposed stream structure certainly
isn't it, unless you mean as an alternative to storing the entire
graph once for each resource. A better alternative is to store it
once centrally - in my current implementation it is passed down
through the trigger messages, but since only one traversal can be in
progress at a time it could just as easily be stored in the Stack table of the

database at the slight cost of an extra write.


Agree that edge is the smallest representation of a graph. But it does
not give us a complete picture without doing a DB lookup. Our
assumption was to store streams in IN_PROGRESS resource_lock.data
column. This could be in resource table instead.


That's true, but I think in practice at any point where we need to look at this
we will always have already loaded the Stack from the DB for some other
reason, so we actually can get it for free. (See detailed discussion in my reply
to Anant.)


Aren't we planning to stop loading stack with all resource objects in future to
Address scalability concerns we currently have?


We plan on not loading all of the Resource objects each time we load the
Stack object, but I think we will always need to have loaded the Stack
object (for example, we'll need to check the current traversal ID,
amongst other reasons). So if the serialised dependency graph is stored
in the Stack it will be no big deal.

I'm not opposed to doing that, BTW. In fact, I'm really interested in
your input on how that might help make recovery from failure more
robust. I know Anant mentioned that not storing enough data to
recover when a node dies was his big concern with my current approach.


With streams, We feel recovery will be easier. All we need is a
trigger :)

I can see that by both creating all the sync points at the start of
the traversal and storing the dependency graph in the database
instead of letting it flow through the RPC messages, we would be able
to resume a traversal where it left off, though I'm not sure what that buys

us.


And I guess what you're suggesting is that by having an explicit lock
with the engine ID specified, we can detect when a resource is stuck
in IN_PROGRESS due to an engine going down? That's actually pretty

interesting.


Yeah :)

Based on our call on Thursday, I think you're taking the idea of the
Lock

table too literally. The point of referring to locks is that we can
use the same concepts as the Lock table relies on to do atomic
updates on a particular row of the database, and we can use those
atomic updates to prevent race conditions when implementing
SyncPoints/Aggregators/whatever you want to call them. It's not that
we'd actually use the Lock table itself, which implements a mutex and
therefore offers only a much slower and more stateful way of doing
what we want (lock mutex, change data, unlock mutex).

[Murugan, Visnusaran] Are you suggesting something like a
select-for-

update in resource table itself without having  a lock table?

Yes, that's exactly what I was suggesting.


DB is always good for sync. But we need to be careful not to overdo it.


Yeah, I see what you mean now, it's starting to _feel_ like there'd be too
many things mixed together in the Resource table. Are you aware of some
concrete harm that might cause though? What happens if we overdo it? Is
select-for-update on a huge row more expensive than the whole overhead
of manipulating the Lock?

Just trying to figure out if intuition is leading me astray here.


You are right. There should be no difference apart from little bump
In memory usage. But I think it should be fine.

Will update etherpad by tomorrow.


OK, thanks.

cheers,
Zane.

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown

Reply via email to