On 25 October 2013 23:23, Chris Behrens <cbehr...@codestud.com> wrote:
> On Oct 25, 2013, at 3:46 AM, "Day, Phil" <philip....@hp.com> wrote:
>> Hi Folks,
>>
>> We're very occasionally seeing problems where a thread processing a create 
>> hangs (and we've seen when taking to Cinder and Glance).  Whilst those 
>> issues need to be hunted down in their own rights, they do show up what 
>> seems to me to be a weakness in the processing of delete requests that I'd 
>> like to get some feedback on.
>>
>> Delete is the one operation that is allowed regardless of the Instance state 
>> (since it's a one-way operation, and users should always be able to free up 
>> their quota).   However when we get a create thread hung in one of these 
>> states, the delete requests when they hit the manager will also block as 
>> they are synchronized on the uuid.   Because the user making the delete 
>> request doesn't see anything happen they tend to submit more delete 
>> requests.   The Service is still up, so these go to the computer manager as 
>> well, and eventually all of the threads will be waiting for the lock, and 
>> the compute manager will stop consuming new messages.
>>
>> The problem isn't limited to deletes - although in most cases the change of 
>> state in the API means that you have to keep making different calls to get 
>> past the state checker logic to do it with an instance stuck in another 
>> state.   Users also seem to be more impatient with deletes, as they are 
>> trying to free up quota for other things.
>>
>> So while I know that we should never get a thread into a hung state into the 
>> first place, I was wondering about one of the following approaches to 
>> address just the delete case:
>>
>> i) Change the delete call on the manager so it doesn't wait for the uuid 
>> lock.  Deletes should be coded so that they work regardless of the state of 
>> the VM, and other actions should be able to cope with a delete being 
>> performed from under them.  There is of course no guarantee that the delete 
>> itself won't block as well.
>>
>
> Agree.  I've argued for a long time that our code should be able to handle 
> the instance disappearing.  We do have a number of places where we catch 
> InstanceNotFound to handle this already.

+1 we need to get better at that

>> ii) Record in the API server that a delete has been started (maybe enough to 
>> use the task state being set to DELETEING in the API if we're sure this 
>> doesn't get cleared), and add a periodic task in the compute manager to 
>> check for and delete instances that are in a "DELETING" state for more than 
>> some timeout. Then the API, knowing that the delete will be processes 
>> eventually can just no-op any further delete requests.
>
> We already set to DELETING in the API (unless I'm mistaken -- but I looked at 
> this recently).  However, instead of dropping duplicate deletes, I say they 
> should still be sent/handled.  Any delete code should be able to handle if 
> another delete is occurring at the same time, IMO…  much like how you say 
> other methods should be able to handle an instance disappearing from 
> underneath.  If a compute goes down while 'deleting', a 2nd delete later 
> should still be able to function locally.  Same thing if the message to 
> compute happens to be lost.

+1 the periodic sync task, if the compute comes back after crashing
having lost the delete message, should help spot the in-consistency
and possibly resolve it. We probably need to make those in-consistency
log messages into notifications so its a bit easier to find.

>> iii) Add some hook into the ServiceGroup API so that the timer could depend 
>> on getting a free thread from the compute manager pool (ie run some no-op 
>> task) - so that of there are no free threads then the service becomes down. 
>> That would (eventually) stop the scheduler from sending new requests to it, 
>> and make deleted be processed in the API server but won't of course help 
>> with commands for other instances on the same host.
>
> This seems kinda hacky to me.

I hope we don't need this.

>>
>> iv) Move away from having a general topic and thread pool for all requests, 
>> and start a listener on an instance specific topic for each running instance 
>> on a host (leaving the general topic and pool just for creates and other 
>> non-instance calls like the hypervisor API).   Then a blocked task would 
>> only affect request for a specific instance.
>>
>
> I don't like this one when thinking about scale.  1 million instances = = 1 
> million more queues.

+1

>> I'm tending towards ii) as a simple and pragmatic solution in the near term, 
>> although I like both iii) and iv) as being both generally good enhancments - 
>> but iv) in particular feels like a pretty seismic change.
> I vote for both i) and ii) at minimum.
+1


I also have another idea, so we can better track the user intent, idea (v):

* changing the API to be more task based (see the summit session)

* We would then know what api requests the user has made, and in
roughly what order

* If the user has already called delete, we can reject any new API
requests as a conflicting operation, unless the user cancels the
delete (assuming soft delete is turned on, etc, etc.)

* But perhaps extra delete requests could just be given back the
task_uuid to the existing delete request.

* the period sync can check for pending tasks, so we know what the
user intent was

John

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to