On 26/02/2013, at 2:15 PM, Chris Behrens <cbehr...@codestud.com> wrote:
> > On Feb 25, 2013, at 6:39 PM, Joe Gordon <j...@cloudscaling.com> wrote: > >> >> It looks like the scheduler issues are related to the rabbitmq issues. >> "host 'qh2-rcc77' ... is disabled or has not been heard from in a while" >> >> What does 'nova host-list' say? the clocks must all be synced up? > > Good things to check. It feels like something is spinning way too much > within this filter, though. This can also cause the above message. The > scheduler pulls all of the records before it starts filtering… and if there's > a huge delay somewhere, it can start seeing a bunch of hosts as disabled. > > The filter doesn't look like a problem.. unless there's a large amount of > aggregate metadata… and/or a large amount of key/values for the > instance_type's extra specs. There *is* a DB call in the filter. If that's > blocking for an extended period of time, the whole process is blocked… But I > suspect by the '100% cpu' comment, that this is not the case… So the only > thing I can think of is that it returns a tremendous amount of metadata. > > Adding some extra logging in the filter could be useful. > > - Chris Thanks Chris, I have 2 aggregates and 2 keys defined and each of the 80 hosts has either one or the other. At the moment every flavour has either one or the other too so I don't think it's too much data. I've tracked it down to this call: metadata = db.aggregate_metadata_get_by_host(context, host_state.host) It's taking forever to complete. Just having a look into that code to see why, there is a nested for loop in there so my guess is something to do with that although there is hardly any data in our aggregates tables so I can't see it taking that long. Cheers, Sam _______________________________________________ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp