On 12/19/2013 04:55 AM, Radomir Dopieralski wrote:
On 14/12/13 16:51, Jay Pipes wrote:

[snip]

Instead of focusing on locking issues -- which I agree are very
important in the virtualized side of things where resources are
"thinner" -- I believe that in the bare-metal world, a more useful focus
would be to ensure that the Tuskar API service treats related group
operations (like "deploy an undercloud on these nodes") in a way that
can handle failures in a graceful and/or atomic way.

Atomicity of operations can be achieved by intoducing critical sections.
You basically have two ways of doing that, optimistic and pessimistic.
Pessimistic critical section is implemented with a locking mechanism
that prevents all other processes from entering the critical section
until it is finished.

I'm familiar with the traditional non-distributed software concept of a mutex (or in Windows world, a critical section). But we aren't dealing with traditional non-distributed software here. We're dealing with highly distributed software where components involved in the "transaction" may not be running on the same host or have much awareness of each other at all.

And, in any case (see below), I don't think that this is a problem that needs to be solved in Tuskar.

Perhaps you have some other way of making them atomic that I can't think of?

I should not have used the term atomic above. I actually do not think that the things that Tuskar/Ironic does should be viewed as an atomic operation. More below.

For example, if the construction or installation of one compute worker
failed, adding some retry or retry-after-wait-for-event logic would be
more useful than trying to put locks in a bunch of places to prevent
multiple sysadmins from trying to deploy on the same bare-metal nodes
(since it's just not gonna happen in the real world, and IMO, if it did
happen, the sysadmins/deployers should be punished and have to clean up
their own mess ;)

I don't see why they should be punished, if the UI was assuring them
that they are doing exactly the thing that they wanted to do, at every
step, and in the end it did something completely different, without any
warning. If anyone deserves punishment in such a situation, it's the
programmers who wrote the UI in such a way.

The issue I am getting at is that, in the real world, the problem of multiple users of Tuskar attempting to deploy an undercloud on the exact same set of bare metal machines is just not going to happen. If you think this is actually a real-world problem, and have seen two sysadmins actively trying to deploy an undercloud on bare-metal machines at the same time without unbeknownst to each other, then I feel bad for the sysadmins that found themselves in such a situation, but I feel its their own fault for not knowing about what the other was doing.

Trying to make a complex series of related but distributed actions -- like the underlying actions of the Tuskar -> Ironic API calls -- into an atomic operation is just not a good use of programming effort, IMO. Instead, I'm advocating that programming effort should instead be spent coding a workflow/taskflow pipeline that can gracefully retry failed operations and report the state of the total taskflow back to the user.

Hope that makes more sense,
-jay

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to