On 05/28/2012 01:21 PM, Clint Byrum wrote: > Looks to me that you need to make sure the other side of that RPC > connection is up before nova-compute. I am not familiar with the specifics > of what Nova needs at startup, but I'd guess this is nova-api or keystone. > Thats a pretty easy thing to do in a single system (just mess with the > upstart jobs or init scripts) but across multiple systems, you'll need > some kind of orchestration layer, and even then modeling the dependencies > on the network with some other tool seems like something just begging > to break.
In this case, it's nova-compute expecting nova-network to be up and running when it starts up. This also causes a problem when restarting all of the services at the same time, as seen here: https://bugs.launchpad.net/nova/+bug/999698 > Instead, the timeout should just be multiple minutes during startup, and > the services should all be able to start in parallel if they are on the > same box. I always think of one of those HP EcoPOD that is pre-installed > with everything you need for OpenStack, and just shipped and then turned > on. You could spend a lot of time trying to get that order just right, > or you could just have everything extend their timeouts and get as far > as they can without contact with the other services. > > nova-compute doesn't *know* that the other side is in error, it just > knows that it is not responding. This is not a problem with nova-compute, > so why should nova-compute fail so quickly? One could even argue that > nova-compute should wait *forever* for the other side. From an ops > standpoint, they're both "down", so why make the operations team take > two actions when the actual broken service recovers? The problem is that since nova-network isn't up, the request gets lost. nova-compute is sitting there waiting for a response to a message that was never even received most likely. It's also possible that nova-network received the message but the service stopped before it responded (but that is less likely, I think). The message queues get created by the consumer of messages in nova. So, in this case, nova-network creates the queue. Some possible solutions: 1) We could adjust this code path to just loop around and try again if it hits a timeout. We could make the timeout much shorter than the default, to make recover quicker. The downside would be that we're fixing a single place, when this issue could pop up elsewhere. 2) We could make it so the sender creates the queue if it doesn't exist. This is good because it covers all cases. The bad thing is that we would not be able to set the queue to be auto-deleted in this case, so we could end up with a "leak" of unwanted message queues. I'm tempted to just write a patch that does #1 for now to address the immediate issue and then do something better later if we come up with something. -- Russell Bryant _______________________________________________ Mailing list: https://launchpad.net/~openstack Post to : [email protected] Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp

