Hi Folks,

We're very occasionally seeing problems where a thread processing a create 
hangs (and we've seen when taking to Cinder and Glance).  Whilst those issues 
need to be hunted down in their own rights, they do show up what seems to me to 
be a weakness in the processing of delete requests that I'd like to get some 
feedback on.

Delete is the one operation that is allowed regardless of the Instance state 
(since it's a one-way operation, and users should always be able to free up 
their quota).   However when we get a create thread hung in one of these 
states, the delete requests when they hit the manager will also block as they 
are synchronized on the uuid.   Because the user making the delete request 
doesn't see anything happen they tend to submit more delete requests.   The 
Service is still up, so these go to the computer manager as well, and 
eventually all of the threads will be waiting for the lock, and the compute 
manager will stop consuming new messages.

The problem isn't limited to deletes - although in most cases the change of 
state in the API means that you have to keep making different calls to get past 
the state checker logic to do it with an instance stuck in another state.   
Users also seem to be more impatient with deletes, as they are trying to free 
up quota for other things. 

So while I know that we should never get a thread into a hung state into the 
first place, I was wondering about one of the following approaches to address 
just the delete case:

i) Change the delete call on the manager so it doesn't wait for the uuid lock.  
Deletes should be coded so that they work regardless of the state of the VM, 
and other actions should be able to cope with a delete being performed from 
under them.  There is of course no guarantee that the delete itself won't block 
as well. 

ii) Record in the API server that a delete has been started (maybe enough to 
use the task state being set to DELETEING in the API if we're sure this doesn't 
get cleared), and add a periodic task in the compute manager to check for and 
delete instances that are in a "DELETING" state for more than some timeout. 
Then the API, knowing that the delete will be processes eventually can just 
no-op any further delete requests.

iii) Add some hook into the ServiceGroup API so that the timer could depend on 
getting a free thread from the compute manager pool (ie run some no-op task) - 
so that of there are no free threads then the service becomes down. That would 
(eventually) stop the scheduler from sending new requests to it, and make 
deleted be processed in the API server but won't of course help with commands 
for other instances on the same host.

iv) Move away from having a general topic and thread pool for all requests, and 
start a listener on an instance specific topic for each running instance on a 
host (leaving the general topic and pool just for creates and other 
non-instance calls like the hypervisor API).   Then a blocked task would only 
affect request for a specific instance.

I'm tending towards ii) as a simple and pragmatic solution in the near term, 
although I like both iii) and iv) as being both generally good enhancments - 
but iv) in particular feels like a pretty seismic change.

Thoughts please,

Phil        

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to