On 20/12/13 00:17, Jay Pipes wrote: > On 12/19/2013 04:55 AM, Radomir Dopieralski wrote: >> On 14/12/13 16:51, Jay Pipes wrote: >> >> [snip] >> >>> Instead of focusing on locking issues -- which I agree are very >>> important in the virtualized side of things where resources are >>> "thinner" -- I believe that in the bare-metal world, a more useful focus >>> would be to ensure that the Tuskar API service treats related group >>> operations (like "deploy an undercloud on these nodes") in a way that >>> can handle failures in a graceful and/or atomic way. >> >> Atomicity of operations can be achieved by intoducing critical sections. >> You basically have two ways of doing that, optimistic and pessimistic. >> Pessimistic critical section is implemented with a locking mechanism >> that prevents all other processes from entering the critical section >> until it is finished. > > I'm familiar with the traditional non-distributed software concept of a > mutex (or in Windows world, a critical section). But we aren't dealing > with traditional non-distributed software here. We're dealing with > highly distributed software where components involved in the > "transaction" may not be running on the same host or have much awareness > of each other at all.
Yes, that is precisely why you need to have a single point where they can check if they are not stepping on each other's toes. If you don't, you get race conditions and non-deterministic behavior. The only difference with traditional, non-distributed software is that since the components involved are communicating over a, relatively slow, network, you have a much, much greater chance of actually having a conflict. Scaling the whole thing to hundreds of nodes practically guarantees trouble. > And, in any case (see below), I don't think that this is a problem that > needs to be solved in Tuskar. > >> Perhaps you have some other way of making them atomic that I can't >> think of? > > I should not have used the term atomic above. I actually do not think > that the things that Tuskar/Ironic does should be viewed as an atomic > operation. More below. OK, no operations performed by Tuskar need to be atomic, noted. >>> For example, if the construction or installation of one compute worker >>> failed, adding some retry or retry-after-wait-for-event logic would be >>> more useful than trying to put locks in a bunch of places to prevent >>> multiple sysadmins from trying to deploy on the same bare-metal nodes >>> (since it's just not gonna happen in the real world, and IMO, if it did >>> happen, the sysadmins/deployers should be punished and have to clean up >>> their own mess ;) >> >> I don't see why they should be punished, if the UI was assuring them >> that they are doing exactly the thing that they wanted to do, at every >> step, and in the end it did something completely different, without any >> warning. If anyone deserves punishment in such a situation, it's the >> programmers who wrote the UI in such a way. > > The issue I am getting at is that, in the real world, the problem of > multiple users of Tuskar attempting to deploy an undercloud on the exact > same set of bare metal machines is just not going to happen. If you > think this is actually a real-world problem, and have seen two sysadmins > actively trying to deploy an undercloud on bare-metal machines at the > same time without unbeknownst to each other, then I feel bad for the > sysadmins that found themselves in such a situation, but I feel its > their own fault for not knowing about what the other was doing. How can it be their fault, when at every step of their interaction with the user interface, the user interface was assuring them that they are going to do the right thing (deploy a certain set of nodes), but when they finally hit the confirmation button, did a completely different thing (deployed a different set of nodes)? The only fault I see is in them using such software. Or are you suggesting that they should implement the lock themselves, through e-mails or some other means of communication? Don't get me wrong, the deploy button is just one easy example of this problem. We have it all over the user interface. Even such a simple operation, as retrieving a list of node ids, and then displaying the corresponding information to the user has a race condition in it -- what if some of the nodes get deleted after we get the list of ids, but before we make the call to get node details about them? This should be done as an atomic operation that either locks, or fails if there was a change in the middle of it, and since the calls are to different systems, the only place where you can set a lock or check if there was a change, is the tuskar-api. And no, trying to get again the information about a deleted node won't help -- you can keep retrying for years, and the node will still remain deleted. This is all over the place. And, saying that "this is the user's fault" doesn't help. > Trying to make a complex series of related but distributed actions -- > like the underlying actions of the Tuskar -> Ironic API calls -- into an > atomic operation is just not a good use of programming effort, IMO. > Instead, I'm advocating that programming effort should instead be spent > coding a workflow/taskflow pipeline that can gracefully retry failed > operations and report the state of the total taskflow back to the user. Sure, there are many ways to solve any particular synchronisation problem. Let's say that we have one that can actually be solved by retrying. Do you want to retry infinitely? Would you like to increase the delays between retries exponentially? If so, where are you going to keep the shared counters for the retries? Perhaps in tuskar-api, hmm? Or are you just saying that we should pretend that the nondeteministic bugs appearing due to the lack of synchronization simply don't exist? They cannot be easily reproduced, after all. We could just close our eyes, cover our ears, sing "lalalala" and close any bug reports with such errors with "could not reproduce on my single-user, single-machine development installation". I know that a lot of software companies do exactly that, so I guess it's a valid business practice, I just want to make sure that this is actually the tactic that we are going to take, before commiting to an architectural decision that will make those bugs impossible to fix. -- Radomir Dopieralski _______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
