Hi Zane, We have been going through this chain for quite some time now and we still feel a disconnect in our understanding. Can you put up a etherpad where we can understand your approach. For example: for storing resource dependencies, Are you storing its name, version tuple or just its ID. If I am correct, you are updating all resources on update regardless of their change which will be inefficient if stack contains a million resource. We have similar questions regarding other areas in your implementation, which we believe if we understand the outline of your implementation. It is difficult to get a hold on your approach just by looking at code. Docs strings / Etherpad will help.
About streams, Yes in a million resource stack, the data will be huge, but less than template. Also this stream is stored only In IN_PROGRESS resources. The reason to have entire dependency list to reduce DB queries while a stack update. When you have a singular dependency on each resources similar to your implantation, then we will end up loading Dependencies one at a time and altering almost all resource's dependency regardless of their change. Regarding a 2 template approach for delete, it is not actually 2 different templates. Its just that we have a delete stream To be taken up post update. (Any post operation will be handled as an update) This approach is True when Rollback==True We can always fall back to regular stream (non-delete stream) if Rollback=False In our view we would like to have only one basic operation and that is UPDATE. 1. CREATE will be an update where a realized graph == Empty 2. UPDATE will be an update where realized graph == Full/Partial realized (possibly with a delete stream as post operation if Rollback==True) 3. DELETE will be just another update with an empty to_be_realized_graph. It would be great if we can freeze a stable approach by mid-week as Christmas vacations are round the corner. :) :) > -----Original Message----- > From: Zane Bitter [mailto:[email protected]] > Sent: Saturday, December 13, 2014 5:43 AM > To: [email protected] > Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept > showdown > > On 12/12/14 05:29, Murugan, Visnusaran wrote: > > > > > >> -----Original Message----- > >> From: Zane Bitter [mailto:[email protected]] > >> Sent: Friday, December 12, 2014 6:37 AM > >> To: [email protected] > >> Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept > >> showdown > >> > >> On 11/12/14 08:26, Murugan, Visnusaran wrote: > >>>>> [Murugan, Visnusaran] > >>>>> In case of rollback where we have to cleanup earlier version of > >>>>> resources, > >>>> we could get the order from old template. We'd prefer not to have a > >>>> graph table. > >>>> > >>>> In theory you could get it by keeping old templates around. But > >>>> that means keeping a lot of templates, and it will be hard to keep > >>>> track of when you want to delete them. It also means that when > >>>> starting an update you'll need to load every existing previous > >>>> version of the template in order to calculate the dependencies. It > >>>> also leaves the dependencies in an ambiguous state when a resource > >>>> fails, and although that can be worked around it will be a giant pain to > implement. > >>>> > >>> > >>> Agree that looking to all templates for a delete is not good. But > >>> baring Complexity, we feel we could achieve it by way of having an > >>> update and a delete stream for a stack update operation. I will > >>> elaborate in detail in the etherpad sometime tomorrow :) > >>> > >>>> I agree that I'd prefer not to have a graph table. After trying a > >>>> couple of different things I decided to store the dependencies in > >>>> the Resource table, where we can read or write them virtually for > >>>> free because it turns out that we are always reading or updating > >>>> the Resource itself at exactly the same time anyway. > >>>> > >>> > >>> Not sure how this will work in an update scenario when a resource > >>> does not change and its dependencies do. > >> > >> We'll always update the requirements, even when the properties don't > >> change. > >> > > > > Can you elaborate a bit on rollback. > > I didn't do anything special to handle rollback. It's possible that we need > to - > obviously the difference in the UpdateReplace + rollback case is that the > replaced resource is now the one we want to keep, and yet the > replaced_by/replaces dependency will force the newer (replacement) > resource to be checked for deletion first, which is an inversion of the usual > order. > > However, I tried to think of a scenario where that would cause problems and > I couldn't come up with one. Provided we know the actual, real-world > dependencies of each resource I don't think the ordering of those two > checks matters. > > In fact, I currently can't think of a case where the dependency order > between replacement and replaced resources matters at all. It matters in the > current Heat implementation because resources are artificially segmented > into the current and backup stacks, but with a holistic view of dependencies > that may well not be required. I tried taking that line out of the simulator > code and all the tests still passed. If anybody can think of a scenario in > which > it would make a difference, I would be very interested to hear it. > > In any event though, it should be no problem to reverse the direction of that > one edge in these particular circumstances if it does turn out to be a > problem. > > > We had an approach with depends_on > > and needed_by columns in ResourceTable. But dropped it when we > figured > > out we had too many DB operations for Update. > > Yeah, I initially ran into this problem too - you have a bunch of nodes that > are > waiting on the current node, and now you have to go look them all up in the > database to see what else they're waiting on in order to tell if they're ready > to be triggered. > > It turns out the answer is to distribute the writes but centralise the reads. > So > at the start of the update, we read all of the Resources, obtain their > dependencies and build one central graph[1]. We than make that graph > available to each resource (either by passing it as a notification parameter, > or > storing it somewhere central in the DB that they will all have to read anyway, > i.e. the Stack). But when we update a dependency we don't update the > central graph, we update the individual Resource so there's no global lock > required. > > [1] > https://github.com/zaneb/heat-convergence-prototype/blob/distributed- > graph/converge/stack.py#L166-L168 > > >>> Also taking care of deleting resources in order will be an issue. > >> > >> It works fine. > >> > >>> This implies that there will be different versions of a resource > >>> which will even complicate further. > >> > >> No it doesn't, other than the different versions we already have due > >> to UpdateReplace. > >> > >>>>>> This approach reduces DB queries by waiting for completion > >>>>>> notification > >>>> on a topic. The drawback I see is that delete stack stream will be > >>>> huge as it will have the entire graph. We can always dump such data > >>>> in ResourceLock.data Json and pass a simple flag > >>>> "load_stream_from_db" to converge RPC call as a workaround for > >>>> delete > >> operation. > >>>>> > >>>>> This seems to be essentially equivalent to my 'SyncPoint' > >>>>> proposal[1], with > >>>> the key difference that the data is stored in-memory in a Heat > >>>> engine rather than the database. > >>>>> > >>>>> I suspect it's probably a mistake to move it in-memory for similar > >>>>> reasons to the argument Clint made against synchronising the > >>>>> marking off > >>>> of dependencies in-memory. The database can handle that and the > >>>> problem of making the DB robust against failures of a single > >>>> machine has already been solved by someone else. If we do it > >>>> in-memory we are just creating a single point of failure for not > >>>> much gain. (I guess you could argue it doesn't matter, since if any > >>>> Heat engine dies during the traversal then we'll have to kick off > >>>> another one anyway, but it does limit our options if that changes > >>>> in the > >>>> future.) [Murugan, Visnusaran] Resource completes, removes itself > >>>> from resource_lock and notifies engine. Engine will acquire parent > >>>> lock and initiate parent only if all its children are satisfied (no > >>>> child entry in > >> resource_lock). > >>>> This will come in place of Aggregator. > >>>> > >>>> Yep, if you s/resource_lock/SyncPoint/ that's more or less exactly > >>>> what I > >> did. > >>>> The three differences I can see are: > >>>> > >>>> 1) I think you are proposing to create all of the sync points at > >>>> the start of the traversal, rather than on an as-needed basis. This > >>>> is probably a good idea. I didn't consider it because of the way my > >>>> prototype evolved, but there's now no reason I can see not to do this. > >>>> If we could move the data to the Resource table itself then we > >>>> could even get it for free from an efficiency point of view. > >>> > >>> +1. But we will need engine_id to be stored somewhere for recovery > >> purpose (easy to be queried format). > >> > >> Yeah, so I'm starting to think you're right, maybe the/a Lock table > >> is the right thing to use there. We could probably do it within the > >> resource table using the same select-for-update to set the engine_id, > >> but I agree that we might be starting to jam too much into that one table. > >> > > > > yeah. Unrelated values in resource table. Upon resource completion we > > have to unset engine_id as well as compared to dropping a row from > resource lock. > > Both are good. Having engine_id in resource_table will reduce db > > operaions in half. We should go with just resource table along with > engine_id. > > OK > > >>> Sync points are created as-needed. Single resource is enough to > >>> restart > >> that entire stream. > >>> I think there is a disconnect in our understanding. I will detail it > >>> as well in > >> the etherpad. > >> > >> OK, that would be good. > >> > >>>> 2) You're using a single list from which items are removed, rather > >>>> than two lists (one static, and one to which items are added) that > >>>> get > >> compared. > >>>> Assuming (1) then this is probably a good idea too. > >>> > >>> Yeah. We have a single list per active stream which work by removing > >>> Complete/satisfied resources from it. > >> > >> I went to change this and then remembered why I did it this way: the > >> sync point is also storing data about the resources that are > >> triggering it. Part of this is the RefID and attributes, and we could > >> replace that by storing that data in the Resource itself and querying > >> it rather than having it passed in via the notification. But the > >> other part is the ID/key of those resources, which we _need_ to know > >> in order to update the requirements in case one of them has been > >> replaced and thus the graph doesn't reflect it yet. (Or, for that > >> matter, we need it to know where to go looking for the RefId and/or > >> attributes if they're in the > >> DB.) So we have to store some data, we can't just remove items from > >> the required list (although we could do that as well). > >> > >>>> 3) You're suggesting to notify the engine unconditionally and let > >>>> the engine decide if the list is empty. That's probably not a good > >>>> idea - not only does it require extra reads, it introduces a race > >>>> condition that you then have to solve (it can be solved, it's just more > work). > >>>> Since the update to remove a child from the list is atomic, it's > >>>> best to just trigger the engine only if the list is now empty. > >>>> > >>> > >>> No. Notify only if stream has something to be processed. The newer > >>> Approach based on db lock will be that the last resource will > >>> initiate its > >> parent. > >>> This is opposite to what our Aggregator model had suggested. > >> > >> OK, I think we're on the same page on this one then. > >> > > > > > > Yeah. > > > >>>>> It's not clear to me how the 'streams' differ in practical terms > >>>>> from just passing a serialisation of the Dependencies object, > >>>>> other than being incomprehensible to me ;). The current > >>>>> Dependencies implementation > >>>>> (1) is a very generic implementation of a DAG, (2) works and has > >>>>> plenty of > >>>> unit tests, (3) has, with I think one exception, a pretty > >>>> straightforward API, > >>>> (4) has a very simple serialisation, returned by the edges() > >>>> method, which can be passed back into the constructor to recreate > >>>> it, and (5) has an API that is to some extent relied upon by > >>>> resources, and so won't likely be removed outright in any event. > >>>>> Whatever code we need to handle dependencies ought to just build > >>>>> on > >>>> this existing implementation. > >>>>> [Murugan, Visnusaran] Our thought was to reduce payload size > >>>> (template/graph). Just planning for worst case scenario (million > >>>> resource > >>>> stack) We could always dump them in ResourceLock.data to be loaded > >>>> by Worker. > > With the latest updates to the Etherpad, I'm even more confused by streams > than I was before. > > One thing I never understood is why do you need to store the whole path to > reach each node in the graph? Surely you only need to know the nodes this > one is waiting on, the nodes waiting on this one and the ones those are > waiting on, not the entire history up to this point. The size of each stream > is > theoretically up to O(n^2) and you're storing n of them - that's going to get > painful in this million-resource stack. > > >>>> If there's a smaller representation of a graph than a list of edges > >>>> then I don't know what it is. The proposed stream structure > >>>> certainly isn't it, unless you mean as an alternative to storing > >>>> the entire graph once for each resource. A better alternative is to > >>>> store it once centrally - in my current implementation it is passed > >>>> down through the trigger messages, but since only one traversal can > >>>> be in progress at a time it could just as easily be stored in the > >>>> Stack table of the > >> database at the slight cost of an extra write. > >>>> > >>> > >>> Agree that edge is the smallest representation of a graph. But it > >>> does not give us a complete picture without doing a DB lookup. Our > >>> assumption was to store streams in IN_PROGRESS resource_lock.data > >>> column. This could be in resource table instead. > >> > >> That's true, but I think in practice at any point where we need to > >> look at this we will always have already loaded the Stack from the DB > >> for some other reason, so we actually can get it for free. (See > >> detailed discussion in my reply to Anant.) > >> > > > > Aren't we planning to stop loading stack with all resource objects in > > future to Address scalability concerns we currently have? > > We plan on not loading all of the Resource objects each time we load the > Stack object, but I think we will always need to have loaded the Stack object > (for example, we'll need to check the current traversal ID, amongst other > reasons). So if the serialised dependency graph is stored in the Stack it > will be > no big deal. > > >>>> I'm not opposed to doing that, BTW. In fact, I'm really interested > >>>> in your input on how that might help make recovery from failure > >>>> more robust. I know Anant mentioned that not storing enough data to > >>>> recover when a node dies was his big concern with my current > approach. > >>>> > >>> > >>> With streams, We feel recovery will be easier. All we need is a > >>> trigger :) > >>> > >>>> I can see that by both creating all the sync points at the start of > >>>> the traversal and storing the dependency graph in the database > >>>> instead of letting it flow through the RPC messages, we would be > >>>> able to resume a traversal where it left off, though I'm not sure > >>>> what that buys > >> us. > >>>> > >>>> And I guess what you're suggesting is that by having an explicit > >>>> lock with the engine ID specified, we can detect when a resource is > >>>> stuck in IN_PROGRESS due to an engine going down? That's actually > >>>> pretty > >> interesting. > >>>> > >>> > >>> Yeah :) > >>> > >>>>> Based on our call on Thursday, I think you're taking the idea of > >>>>> the Lock > >>>> table too literally. The point of referring to locks is that we can > >>>> use the same concepts as the Lock table relies on to do atomic > >>>> updates on a particular row of the database, and we can use those > >>>> atomic updates to prevent race conditions when implementing > >>>> SyncPoints/Aggregators/whatever you want to call them. It's not > >>>> that we'd actually use the Lock table itself, which implements a > >>>> mutex and therefore offers only a much slower and more stateful way > >>>> of doing what we want (lock mutex, change data, unlock mutex). > >>>>> [Murugan, Visnusaran] Are you suggesting something like a > >>>>> select-for- > >>>> update in resource table itself without having a lock table? > >>>> > >>>> Yes, that's exactly what I was suggesting. > >>> > >>> DB is always good for sync. But we need to be careful not to overdo it. > >> > >> Yeah, I see what you mean now, it's starting to _feel_ like there'd > >> be too many things mixed together in the Resource table. Are you > >> aware of some concrete harm that might cause though? What happens if > >> we overdo it? Is select-for-update on a huge row more expensive than > >> the whole overhead of manipulating the Lock? > >> > >> Just trying to figure out if intuition is leading me astray here. > >> > > > > You are right. There should be no difference apart from little bump In > > memory usage. But I think it should be fine. > > > >>> Will update etherpad by tomorrow. > >> > >> OK, thanks. > >> > >> cheers, > >> Zane. > >> > >> _______________________________________________ > >> OpenStack-dev mailing list > >> [email protected] > >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > _______________________________________________ > > OpenStack-dev mailing list > > [email protected] > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > _______________________________________________ > OpenStack-dev mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev _______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
