Excerpts from Day, Phil's message of 2013-10-25 03:46:01 -0700: > Hi Folks, > > We're very occasionally seeing problems where a thread processing a create > hangs (and we've seen when taking to Cinder and Glance). Whilst those issues > need to be hunted down in their own rights, they do show up what seems to me > to be a weakness in the processing of delete requests that I'd like to get > some feedback on. > > Delete is the one operation that is allowed regardless of the Instance state > (since it's a one-way operation, and users should always be able to free up > their quota). However when we get a create thread hung in one of these > states, the delete requests when they hit the manager will also block as they > are synchronized on the uuid. Because the user making the delete request > doesn't see anything happen they tend to submit more delete requests. The > Service is still up, so these go to the computer manager as well, and > eventually all of the threads will be waiting for the lock, and the compute > manager will stop consuming new messages. > > The problem isn't limited to deletes - although in most cases the change of > state in the API means that you have to keep making different calls to get > past the state checker logic to do it with an instance stuck in another > state. Users also seem to be more impatient with deletes, as they are > trying to free up quota for other things. > > So while I know that we should never get a thread into a hung state into the > first place, I was wondering about one of the following approaches to address > just the delete case: > > i) Change the delete call on the manager so it doesn't wait for the uuid > lock. Deletes should be coded so that they work regardless of the state of > the VM, and other actions should be able to cope with a delete being > performed from under them. There is of course no guarantee that the delete > itself won't block as well. >
Almost anything unexpected that isn't "start the creation" results in just marking an instance as an ERROR right? So this approach is actually pretty straight forward to implement. You don't really have to make other operations any more intelligent than they already should be in cleaning up half-done operations when they encounter an error. It might be helpful to suppress or de-prioritize logging of these errors when it is obvious that this result was intended. > ii) Record in the API server that a delete has been started (maybe enough to > use the task state being set to DELETEING in the API if we're sure this > doesn't get cleared), and add a periodic task in the compute manager to check > for and delete instances that are in a "DELETING" state for more than some > timeout. Then the API, knowing that the delete will be processes eventually > can just no-op any further delete requests. > s/API server/database/ right? I like the coalescing approach where you no longer take up more resources for repeated requests. I don't like the garbage collection aspect of this plan though.Garbage collection is a trade off of user experience for resources. If your GC thread gets too far behind your resources will be exhausted. If you make it too active, it wastes resources doing the actual GC. Add in that you have a timeout before things can be garbage collected and I think this becomes a very tricky thing to tune, and it may not be obvious it needs to be tuned until you have a user who does a lot of rapid create/delete cycles. > iii) Add some hook into the ServiceGroup API so that the timer could depend > on getting a free thread from the compute manager pool (ie run some no-op > task) - so that of there are no free threads then the service becomes down. > That would (eventually) stop the scheduler from sending new requests to it, > and make deleted be processed in the API server but won't of course help with > commands for other instances on the same host. > I'm not sure I understand this one. > iv) Move away from having a general topic and thread pool for all requests, > and start a listener on an instance specific topic for each running instance > on a host (leaving the general topic and pool just for creates and other > non-instance calls like the hypervisor API). Then a blocked task would only > affect request for a specific instance. > A topic per record will get out of hand rapidly. If you think of the instance record in the DB as the topic though, then (i) and (iv) are actually quite similar. _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev