On Wed, Aug 31, 2016 at 1:33 AM, Joshua Harlow <[email protected]> wrote: >> >> Enabling this option will make it so Nova scheduler loads instance >> info asynchronously at start up. Depending on the number of >> hypervisors and instances, it can take several minutes. (we are >> talking about 10-15 minutes with 600+ Ironic nodes, or ~1s per node in >> our case) > > > This feels like a classic thing that could just be made better by a > scatter/gather (in threads or other?) to the database or other service. 1s > per node seems ummm, sorta bad and/or non-optimal (I wonder if this is low > hanging fruit to improve this). I can travel around the world 7.5 times in > that amount of time (if I was a light beam, haha).
This behavior was only triggered under the following conditions: - Nova Kilo - scheduler_tracks_instance_changes=False So someone installing the latest Nova version won't have this issue. Furthermore, if you enable scheduler_tracks_instance_changes, instances will be loaded asynchronously by chunk when nova-scheduler starts. (10 compute nodes at a time) But Jim found that enabling this config causes OOM errors. So I investigated and found a very interesting bug presents if you run Nova in the Ironic context or anything where a single nova-compute process manages multiple or LOT of hypervisors. As explained previously, Nova loads the list of instances per compute node to help with placement decisions: https://github.com/openstack/nova/blob/kilo-eol/nova/scheduler/host_manager.py#L590 Again, in Ironic context, a single nova-compute host manages ALL instances. This means this specific line found in _add_instance_info will load ALL instances managed by that single nova-compute host. What's even funnier is that _add_instance_info is called from get_all_host_states for every compute nodes (hypervisors), NOT nova-compute host. This means if you have 2000 hypervisors (Ironic nodes), this function will load 2000 instances per hypervisor found in get_all_host_states, ending with an overall process loading 2000^2 rows from the database. Now I know why Jim Roll complained about OOM error. objects.InstanceList.get_by_host_and_node should be used instead, NOT objects.InstanceList.get_by_host. Will report this bug soon. >> >> There is a lot of side-effects to using it though. For example: >> - you can only run ONE nova-scheduler process since cache state won't >> be shared between processes and you don't want instances to be >> scheduled twice to the same node/hypervisor. > > > Out of curiosity, do you have only one scheduler process active and passive > scheduler process(es) idle waiting to become active if the other schedule > dies? (pretty simply done via something like > https://kazoo.readthedocs.io/en/latest/api/recipe/election.html) Or do you > have some manual/other process that kicks off a new scheduler if the 'main' > one dies? We use the HA feature of our virtualization infrastructure to handle failover. This is a compromise we are willing to accept for now. I agree that now everybody has access to this kind of feature in their infra. >> 2) Run a single nova-compute service >> >> I strongly suggest you DO NOT run multiple nova-compute services. If >> you do, you will have duplicated hypervisors loaded by the scheduler >> and you could end up with conflicting scheduling. You will also have >> twice as much hypervisors to load in the scheduler. > > > This seems scary (whenever I hear run a single of anything in a *cloud* > platform, that makes me shiver). It'd be nice if we at least recommended > people run https://kazoo.readthedocs.io/en/latest/api/recipe/election.html > or have some active/passive automatic election process to handle that single > thing dying (which they usually do, at odd times of the night). Honestly I'd > (personally) really like to get to the bottom of how we as a group of > developers ever got to the place where software was released (and/or even > recommended to be used) in a *cloud* platform that ever required only one of > anything to be ran (that's crazy bonkers, and yes there is history here, but > damn, it just feels rotten as all hell, for lack of better words). Same as above. If nova-compute process stops, customers won't lose access to their baremetal but won't be able to manage them (create, start, stop). In our context, that's not something they do often. In fact, we more often than not deliver the baremetal for them in their projects/tenants and they pretty much never touch the API anyway. Also there is this hash ring feature coming in latest Nova version. Meanwhile we are happy with the compromise. >> 3) Increase service_down_time >> >> If you have a lot of nodes, you might have to increase this value >> which is set to 60 seconds by default. This value is used by the >> ComputeFilter filter to exclude nodes it hasn't heard from. If it >> takes more than 60 seconds to list the list of nodes, you might guess >> what we will happen, the scheduler will reject all of them since node >> info is already outdated when it finally hits the filtering steps. I >> strongly suggest you tweak this setting, regardless of the use of >> CachingScheduler. > > > Same kind of feeling I had above also applies, something feels broken if > such things have to be found by operators (I'm pretty sure yahoo when I was > there saw something similar) and not by the developers making the software. > If I could (and I know I really can't due to the community we work in) I'd > very much have an equivalent of a retrospective around how these kinds of > solutions got built and how they ended up getting released to the wider > public with such flaws.... The bug got fixed by Jim Roll as pointed out earlier. So I think this particular recommendation might not apply if you are using latest Nova version. Bugs happen ¯\_(ツ)_/¯ and it just happens that someone caught it when using Ironic in Liberty. We would have caught it too if we paid more attention to performance, did scaling tests and profiled the code a bit more before complaining publicly. But the other bug I found and mentioned above still exist. Fortunately, it won't show in Ironic context anymore since Jim made it so Ironic host manager never loads list of instances per node ; it's something we don't care about with baremetal. But if you are running Kilo, you are out of luck and will be hitting all this madness. >> [1] >> https://github.com/openstack/nova/blob/kilo-eol/nova/scheduler/host_manager.py#L589-L592 >> [2] >> https://github.com/openstack/nova/blob/kilo-eol/nova/scheduler/host_manager.py#L65-L68 >> [3] >> http://docs.openstack.org/developer/ironic/deploy/install-guide.html#configure-compute-to-use-the-bare-metal-service >> [4] >> https://github.com/openstack/nova/blob/282c257aff6b53a1b6bb4b4b034a670c450d19d8/nova/conf/scheduler.py#L166-L185 >> [5] https://bugs.launchpad.net/nova/+bug/1479124 >> [6] https://www.youtube.com/watch?v=BcHyiOdme2s >> [7] https://gist.github.com/mgagne/1fbeca4c0b60af73f019bc2e21eb4a80 -- Mathieu _______________________________________________ OpenStack-operators mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
