Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

Sylvain Bauza Tue, 06 Jun 2017 02:17:44 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256



Le 05/06/2017 23:22, Ed Leafe a écrit :
> We had a very lively discussion this morning during the Scheduler
> subteam meeting, which was continued in a Google hangout. The
> subject was how to handle claiming resources when the Resource
> Provider is not "simple". By "simple", I mean a compute node that
> provides all of the resources itself, as contrasted with a compute
> node that uses a shared storage for disk space, or which has
> complex nested relationships with things such as PCI devices or
> NUMA nodes. The current situation is as follows:
> 
> a) scheduler gets a request with certain resource requirements
> (RAM, disk, CPU, etc.) b) scheduler passes these resource
> requirements to placement, which returns a list of hosts (compute
> nodes) that can satisfy the request. c) scheduler runs these
> through some filters and weighers to get a list ordered by best
> "fit" d) it then tries to claim the resources, by posting to
> placement allocations for these resources against the selected
> host e) once the allocation succeeds, scheduler returns that host
> to conductor to then have the VM built
> 
> (some details for edge cases left out for clarity of the overall
> process)
> 
> The problem we discussed comes into play when the compute node
> isn't the actual provider of the resources. The easiest example to
> consider is when the computes are associated with a shared storage
> provider. The placement query is smart enough to know that even if
> the compute node doesn't have enough local disk, it will get it
> from the shared storage, so it will return that host in step b)
> above. If the scheduler then chooses that host, when it tries to
> claim it, it will pass the resources and the compute node UUID back
> to placement to make the allocations. This is the point where the
> current code would fall short: somehow, placement needs to know to
> allocate the disk requested against the shared storage provider,
> and not the compute node.
> 
> One proposal is to essentially use the same logic in placement that
> was used to include that host in those matching the requirements.
> In other words, when it tries to allocate the amount of disk, it
> would determine that that host is in a shared storage aggregate,
> and be smart enough to allocate against that provider. This was
> referred to in our discussion as "Plan A".
> 
> Another proposal involved a change to how placement responds to the
> scheduler. Instead of just returning the UUIDs of the compute nodes
> that satisfy the required resources, it would include a whole bunch
> of additional information in a structured response. A straw man
> example of such a response is here:
> https://etherpad.openstack.org/p/placement-allocations-straw-man.
> This was referred to as "Plan B". The main feature of this approach
> is that part of that response would be the JSON dict for the
> allocation call, containing the specific resource provider UUID for
> each resource. This way, when the scheduler selects a host, it
> would simply pass that dict back to the /allocations call, and
> placement would be able to do the allocations directly against that
> information.
> 
> There was another issue raised: simply providing the host UUIDs
> didn't give the scheduler enough information in order to run its
> filters and weighers. Since the scheduler uses those UUIDs to
> construct HostState objects, the specific missing information was
> never completely clarified, so I'm just including this aspect of
> the conversation for completeness. It is orthogonal to the question
> of how to allocate when the resource provider is not "simple".
> 
> My current feeling is that we got ourselves into our existing mess
> of ugly, convoluted code when we tried to add these complex
> relationships into the resource tracker and the scheduler. We set
> out to create the placement engine to bring some sanity back to how
> we think about things we need to virtualize. I would really hate to
> see us make the same mistake again, by adding a good deal of
> complexity to handle a few non-simple cases. What I would like to
> avoid, no matter what the eventual solution chosen, is representing
> this complexity in multiple places. Currently the only two
> candidates for this logic are the placement engine, which knows
> about these relationships already, or the compute service itself,
> which has to handle the management of these complex virtualized
> resources.
> 
> I don't know the answer. I'm hoping that we can have a discussion
> that might uncover a clear approach, or, at the very least, one
> that is less murky than the others.
> 

I wasn't part neither of the scheduler meeting nor the hangout (hitted
by French holiday) so I don't get all the details in mind and I could
probably make wrong assumptions, so I apology in advance if I'm
telling anything silly.

That said, I still have some opinions and I'll put them here. Thanks
for having brought up that problem here, Ed.

The intent of the scheduler is to pick a destination where an instance
can land (the old concept of 'can_host' thingy). Getting back a list
of "sharing RPs" (a shared volume) doesn't really help the
decision-making I feel. What a user could want is to have the
scheduler pick a destination that is *close* to that "sharing RP" or
which would not be "shared-with" that "sharing RP", but I don't feel
the need for us to return the list of "things-that-can-not-host" (ie.
"sharing RPs") when the scheduler is asking the list of potential
targets to Placement.

That said, in order to have scheduling decisions based on filters
(like a dummy OnlyLocalDiskFilter or a NetworkSegmentedOnlyFilter) or
weighers (PreferMeLocalDisksWeigher), we could potentially imagine the
construct returned by Placement for the resource classes it handles to
be having more than just RP UUIDs but a list of extended dictionaries
(one per Resource Provider) of "inventories-minus-allocations" (ie.
what's left in the cloud) keyed by resource classes.

Of course, the size of the result could be a problem. Couldn't we
imagine limited paging for that ? Of course, the sorting is
underterministic as the construct of that list is depending on what's
available on time.

The Plan A option you mention hides the complexity of the
shared/non-shared logic but to the price that it would make scheduling
decisions on those criteries impossible unless you put
filtering/weighting logic into Placement, which AFAIK we strongly
disagree with.

- -Sylvain

> 
> -- Ed Leafe
> 
> 
> 
> 
> 
> 
> 
> ______________________________________________________________________
____
>
> 
OpenStack Development Mailing List (not for usage questions)
> Unsubscribe:
> [email protected]?subject:unsubscribe 
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBCAAGBQJZNnKOAAoJEO+lyg5yIElq/H0IAMb7FA7KyqGkoWo1zriDYe+L
rMT005F+xf8siNWDn3BY1HEJPcGmYrTFDMEia8QUq3tDsdselGxs+V4K/jd9Ih+M
rcf9Kz73odG1HzERlGvduvh7OgGedmZqds50LeyGsHyxN+whoAcIQ5/I2h16jzRa
EZtW8wINMr1qbxkM/PiFsEl0jhJn6IJAjH+LMx5H3ynNGp7gEjLQcdrDP5DNFOLS
PXU84ra2BDeFF4Y5pT1iP3JsL8GZrOYFWGY4K83n+D+MFiHWfEbZr0xenwyhGjbo
PvgW8KSu+paB1k+cRfZauIsoN1GwCCzONIWSHMSDtRbY++s8hwjcMC8bOjvl6Xs=
=b8ex
-----END PGP SIGNATURE-----

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

Reply via email to