On Aug 26, 2011, at 2:22 PM, Ed Leafe wrote:

>       Sorry I haven't come up with a snazzy name for it yet, but what I have 
> in mind is a new service that is essential for my employer (Rackspace), and 
> might be important for other OpenStack deployments. This new service would be 
> completely optional, of course - only those for whom it is relevant would run 
> it.
> 
>       Let me start by stating the problem: when a customer requests that we 
> create instances for them, nova casts those requests into the queue, where 
> they are eventually acted upon. That usually works great, but in cases where 
> the instance creation fails, we need to detect that failure and re-issue the 
> create request with a different host. This is currently not possible with the 
> asynchronous design of the compute-scheduler interactions.
> 
>       So what I envision is a service that scans a list of recent requests' 
> reservation IDs, and follows up to see if the request was successful or not, 
> and takes action if needed. The blueprint for this can be found at 
> https://blueprints.launchpad.net/nova/+spec/instance-creation-assurance, with 
> an Etherpad created for ongoing idea exchange at 
> http://etherpad.openstack.org/instance-creation-assurance

Hmmm.. having looked over this, I agree that we need to have a way to retry 
failed builds, however I do not think that having another service essentially 
polling the builds to find failures is the right way to go. 

First off, I think it would be better if whatever had the failure responded by 
sending a request somewhere (a cast) to say "Hey, this bombed. Retry it. "  I 
wouldn't try doing calls instead of casts, as some have suggested, for 
performance reasons. (and I could see deadlocking issues) 

If we step back and look at this, these requests/orders/whatever you call it 
amount to multi-step workflows.  Even for building a single server you have 
things like "allocate this instance on a hypervisor", "Assign IP's" "Attach 
these volumes",  any of which could fail for some reason.   And if they do 
fail, there may be steps need to back-out already completed work. 

The proper way, IMHO, for this to work is that a request generates a workorder 
with a set of tasks.  
This gets sent to something (scheduler, probably) which looks at the first 
uncompleted task on the workorder, makes the decision on where to send it, and 
routes the whole workorder there.  
The service that gets it performs the task (i.e. executes the method), possibly 
attaching  info (like id of newly created instance) to the workorder, and 
possibly pushing an 'undo' task to the top of a list of tasks to perform if 
things fail somewhere.  
Then the whole workorder gets sent back to the origin (again, scheduler?) This 
looks at the next uncompleted task, and starts the cycle again.  
Repeat until done. 

If there is a failure, the scheduler works through the 'undo' list on the 
workorder, and then makes whatever decisions are needed to redo the tasks 
elsewhere.  The workorder contains the record of the failed attempt, so it 
doesn't, for example, try to send the server build back to the same hosts that 
just failed. 

The workorder acts as an environment for the tasks, and could be passed to 
tasks (rpc methods) as an attribute of the context object. 


Anyway, that is my notion.  Flame away. 



--
        Monsyne M. Dragon
        OpenStack/Nova 
        cell 210-441-0965
        work x 5014190

This email may include confidential information. If you received it in error, 
please delete it.


_______________________________________________
Mailing list: https://launchpad.net/~openstack
Post to     : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp

Reply via email to