-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
Le 05/06/2017 23:22, Ed Leafe a écrit : > We had a very lively discussion this morning during the Scheduler > subteam meeting, which was continued in a Google hangout. The > subject was how to handle claiming resources when the Resource > Provider is not "simple". By "simple", I mean a compute node that > provides all of the resources itself, as contrasted with a compute > node that uses a shared storage for disk space, or which has > complex nested relationships with things such as PCI devices or > NUMA nodes. The current situation is as follows: > > a) scheduler gets a request with certain resource requirements > (RAM, disk, CPU, etc.) b) scheduler passes these resource > requirements to placement, which returns a list of hosts (compute > nodes) that can satisfy the request. c) scheduler runs these > through some filters and weighers to get a list ordered by best > "fit" d) it then tries to claim the resources, by posting to > placement allocations for these resources against the selected > host e) once the allocation succeeds, scheduler returns that host > to conductor to then have the VM built > > (some details for edge cases left out for clarity of the overall > process) > > The problem we discussed comes into play when the compute node > isn't the actual provider of the resources. The easiest example to > consider is when the computes are associated with a shared storage > provider. The placement query is smart enough to know that even if > the compute node doesn't have enough local disk, it will get it > from the shared storage, so it will return that host in step b) > above. If the scheduler then chooses that host, when it tries to > claim it, it will pass the resources and the compute node UUID back > to placement to make the allocations. This is the point where the > current code would fall short: somehow, placement needs to know to > allocate the disk requested against the shared storage provider, > and not the compute node. > > One proposal is to essentially use the same logic in placement that > was used to include that host in those matching the requirements. > In other words, when it tries to allocate the amount of disk, it > would determine that that host is in a shared storage aggregate, > and be smart enough to allocate against that provider. This was > referred to in our discussion as "Plan A". > > Another proposal involved a change to how placement responds to the > scheduler. Instead of just returning the UUIDs of the compute nodes > that satisfy the required resources, it would include a whole bunch > of additional information in a structured response. A straw man > example of such a response is here: > https://etherpad.openstack.org/p/placement-allocations-straw-man. > This was referred to as "Plan B". The main feature of this approach > is that part of that response would be the JSON dict for the > allocation call, containing the specific resource provider UUID for > each resource. This way, when the scheduler selects a host, it > would simply pass that dict back to the /allocations call, and > placement would be able to do the allocations directly against that > information. > > There was another issue raised: simply providing the host UUIDs > didn't give the scheduler enough information in order to run its > filters and weighers. Since the scheduler uses those UUIDs to > construct HostState objects, the specific missing information was > never completely clarified, so I'm just including this aspect of > the conversation for completeness. It is orthogonal to the question > of how to allocate when the resource provider is not "simple". > > My current feeling is that we got ourselves into our existing mess > of ugly, convoluted code when we tried to add these complex > relationships into the resource tracker and the scheduler. We set > out to create the placement engine to bring some sanity back to how > we think about things we need to virtualize. I would really hate to > see us make the same mistake again, by adding a good deal of > complexity to handle a few non-simple cases. What I would like to > avoid, no matter what the eventual solution chosen, is representing > this complexity in multiple places. Currently the only two > candidates for this logic are the placement engine, which knows > about these relationships already, or the compute service itself, > which has to handle the management of these complex virtualized > resources. > > I don't know the answer. I'm hoping that we can have a discussion > that might uncover a clear approach, or, at the very least, one > that is less murky than the others. > I wasn't part neither of the scheduler meeting nor the hangout (hitted by French holiday) so I don't get all the details in mind and I could probably make wrong assumptions, so I apology in advance if I'm telling anything silly. That said, I still have some opinions and I'll put them here. Thanks for having brought up that problem here, Ed. The intent of the scheduler is to pick a destination where an instance can land (the old concept of 'can_host' thingy). Getting back a list of "sharing RPs" (a shared volume) doesn't really help the decision-making I feel. What a user could want is to have the scheduler pick a destination that is *close* to that "sharing RP" or which would not be "shared-with" that "sharing RP", but I don't feel the need for us to return the list of "things-that-can-not-host" (ie. "sharing RPs") when the scheduler is asking the list of potential targets to Placement. That said, in order to have scheduling decisions based on filters (like a dummy OnlyLocalDiskFilter or a NetworkSegmentedOnlyFilter) or weighers (PreferMeLocalDisksWeigher), we could potentially imagine the construct returned by Placement for the resource classes it handles to be having more than just RP UUIDs but a list of extended dictionaries (one per Resource Provider) of "inventories-minus-allocations" (ie. what's left in the cloud) keyed by resource classes. Of course, the size of the result could be a problem. Couldn't we imagine limited paging for that ? Of course, the sorting is underterministic as the construct of that list is depending on what's available on time. The Plan A option you mention hides the complexity of the shared/non-shared logic but to the price that it would make scheduling decisions on those criteries impossible unless you put filtering/weighting logic into Placement, which AFAIK we strongly disagree with. - -Sylvain > > -- Ed Leafe > > > > > > > > ______________________________________________________________________ ____ > > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: > [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAEBCAAGBQJZNnKOAAoJEO+lyg5yIElq/H0IAMb7FA7KyqGkoWo1zriDYe+L rMT005F+xf8siNWDn3BY1HEJPcGmYrTFDMEia8QUq3tDsdselGxs+V4K/jd9Ih+M rcf9Kz73odG1HzERlGvduvh7OgGedmZqds50LeyGsHyxN+whoAcIQ5/I2h16jzRa EZtW8wINMr1qbxkM/PiFsEl0jhJn6IJAjH+LMx5H3ynNGp7gEjLQcdrDP5DNFOLS PXU84ra2BDeFF4Y5pT1iP3JsL8GZrOYFWGY4K83n+D+MFiHWfEbZr0xenwyhGjbo PvgW8KSu+paB1k+cRfZauIsoN1GwCCzONIWSHMSDtRbY++s8hwjcMC8bOjvl6Xs= =b8ex -----END PGP SIGNATURE----- __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
