Ugh, hit send accidentally. Please take my comments lightly as I have not been as involved with the developments but just chiming in as an operator with some ideas.
On Fri, Nov 2, 2018 at 9:32 PM Matt Riedemann <mriede...@gmail.com> wrote: > > On 11/2/2018 2:22 PM, Eric Fried wrote: > > Based on a (long) discussion yesterday  I have put up a patch  > > whereby you can set [compute]resource_provider_association_refresh to > > zero and the resource tracker will never* refresh the report client's > > provider cache. Philosophically, we're removing the "healing" aspect of > > the resource tracker's periodic and trusting that placement won't > > diverge from whatever's in our cache. (If it does, it's because the op > > hit the CLI, in which case they should SIGHUP - see below.) > > > > *except: > > - When we initially create the compute node record and bootstrap its > > resource provider. > > - When the virt driver's update_provider_tree makes a change, > > update_from_provider_tree reflects them in the cache as well as pushing > > them back to placement. > > - If update_from_provider_tree fails, the cache is cleared and gets > > rebuilt on the next periodic. > > - If you send SIGHUP to the compute process, the cache is cleared. > > > > This should dramatically reduce the number of calls to placement from > > the compute service. Like, to nearly zero, unless something is actually > > changing. > > > > Can I get some initial feedback as to whether this is worth polishing up > > into something real? (It will probably need a bp/spec if so.) > > > >  > > http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-11-01.log.html#t2018-11-01T17:32:03 > > https://review.openstack.org/#/c/614886/ > > > > ========== > > Background > > ========== > > In the Queens release, our friends at CERN noticed a serious spike in > > the number of requests to placement from compute nodes, even in a > > stable-state cloud. Given that we were in the process of adding a ton of > > infrastructure to support sharing and nested providers, this was not > > unexpected. Roughly, what was previously: > > > > @periodic_task: > > GET/resource_providers/$compute_uuid > > GET/resource_providers/$compute_uuid/inventories > > > > became more like: > > > > @periodic_task: > > # In Queens/Rocky, this would still just return the compute RP > > GET /resource_providers?in_tree=$compute_uuid > > # In Queens/Rocky, this would return nothing > > GET /resource_providers?member_of=...&required=MISC_SHARES... > > for each provider returned above: # i.e. just one in Q/R > > GET/resource_providers/$compute_uuid/inventories > > GET/resource_providers/$compute_uuid/traits > > GET/resource_providers/$compute_uuid/aggregates > > > > In a cloud the size of CERN's, the load wasn't acceptable. But at the > > time, CERN worked around the problem by disabling refreshing entirely. > > (The fact that this seems to have worked for them is an encouraging sign > > for the proposed code change.) > > > > We're not actually making use of most of that information, but it sets > > the stage for things that we're working on in Stein and beyond, like > > multiple VGPU types, bandwidth resource providers, accelerators, NUMA, > > etc., so removing/reducing the amount of information we look at isn't > > really an option strategically. > > A few random points from the long discussion that should probably > re-posed here for wider thought: > > * There was probably a lot of discussion about why we needed to do this > caching and stuff in the compute in the first place. What has changed > that we no longer need to aggressively refresh the cache on every > periodic? I thought initially it came up because people really wanted > the compute to be fully self-healing to any external changes, including > hot plugging resources like disk on the host to automatically reflect > those changes in inventory. Similarly, external user/service > interactions with the placement API which would then be automatically > picked up by the next periodic run - is that no longer a desire, and/or > how was the decision made previously that simply requiring a SIGHUP in > that case wasn't sufficient/desirable. I think that would be nice to have however at the current moment, based from operators perspective, it looks like the placement service can really get out of sync pretty easily.. so I think it'd be good to commit to either really making it self-heal (delete stale allocations, create ones that should be there) or remove all self-healing stuff Also, for the self healing work, if we take that route and implement it fully, it might make placement split much easier, because we just switch over and wait for the computes to automagically populate everything, but that's the type of operation that happens once in the lifetime of a cloud. Just for information sake, a clean state cloud which had no reported issues over maybe a period of 2-3 months already has 4 allocations which are incorrect and 12 allocations pointing to the wrong resource provider, so I think this comes does to committing to either "self-healing" to fix those issues or not. > * I believe I made the point yesterday that we should probably not > refresh by default, and let operators opt-in to that behavior if they > really need it, i.e. they are frequently making changes to the > environment, potentially by some external service (I could think of > vCenter doing this to reflect changes from vCenter back into > nova/placement), but I don't think that should be the assumed behavior > by most and our defaults should reflect the "normal" use case. I agree. For 99% of the deployments out there, placement service will likely not be touched by anyone except the services and at this stage, probably just Nova talking to placement directly. I really do agree on the statement that the "normal" use case is of a user playing around with placement out-of-band is not common at all. > * I think I've noted a few times now that we don't actually use the > provider aggregates information (yet) in the compute service. Nova host > aggregate membership is mirror to placement since Rocky  but that > happens in the API, not the the compute. The only thing I can think of > that relied on resource provider aggregate information in the compute is > the shared storage providers concept, but that's not supported (yet) > . So do we need to keep retrieving aggregate information when nothing > in compute uses it yet? Is there anything stopping us here from polling that information during the time when the VM is spawning? It doesn't seem like something that the compute node always needs to check.. > * Similarly, why do we need to get traits on each periodic? The only > in-tree virt driver I'm aware of that *reports* traits is the libvirt > driver for CPU features . Otherwise I think the idea behind getting > the latest traits is so the virt driver doesn't overwrite any traits set > externally on the compute node root resource provider. I think that > still stands and is probably OK, even though we have generations now > which should keep us from overwriting if we don't have the latest > traits, but I wanted to bring it up since it's related to the "why do we > need provider aggregates in the compute?" question. Forgive my ignorance on this subject, but would traits really be only set when the service is first started (so that check can happens only once on startup) and then the compute nodes never really ever consume that information (but the scheduler does?). Also, AFAIK I doubt virt drivers actually report much change in traits (CPU flags changing in runtime?) > * Regardless of what we do, I think we should probably *at least* make > that refresh associations config allow 0 to disable it so CERN (and > others) can avoid the need to continually forward-porting code to > disable it. +1 >  > https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/placement-mirror-host-aggregates.html >  https://bugs.launchpad.net/nova/+bug/1784020 >  > https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/report-cpu-features-as-traits.html > > -- > > Thanks, > > Matt > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mna...@vexxhost.com W. http://vexxhost.com __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev