There has been much discussion. We've gotten to a point of an initial proposal and are ready for more (hopefully smaller, hopefully conclusive) discussion.
To that end, there will be a HANGOUT tomorrow (TUESDAY, JUNE 5TH) at 1500 UTC. Be in #openstack-placement to get the link to join. The strawpeople outlined below and discussed in the referenced etherpad have been consolidated/distilled into a new etherpad [1] around which the hangout discussion will be centered. [1] https://etherpad.openstack.org/p/placement-making-the-(up)grade Thanks, efried On 06/01/2018 01:12 PM, Jay Pipes wrote: > On 05/31/2018 02:26 PM, Eric Fried wrote: >>> 1. Make everything perform the pivot on compute node start (which can be >>> re-used by a CLI tool for the offline case) >>> 2. Make everything default to non-nested inventory at first, and provide >>> a way to migrate a compute node and its instances one at a time (in >>> place) to roll through. >> >> I agree that it sure would be nice to do ^ rather than requiring the >> "slide puzzle" thing. >> >> But how would this be accomplished, in light of the current "separation >> of responsibilities" drawn at the virt driver interface, whereby the >> virt driver isn't supposed to talk to placement directly, or know >> anything about allocations? > FWIW, I don't have a problem with the virt driver "knowing about > allocations". What I have a problem with is the virt driver *claiming > resources for an instance*. > > That's what the whole placement claims resources things was all about, > and I'm not interested in stepping back to the days of long racy claim > operations by having the compute nodes be responsible for claiming > resources. > > That said, once the consumer generation microversion lands [1], it > should be possible to *safely* modify an allocation set for a consumer > (instance) and move allocation records for an instance from one provider > to another. > > [1] https://review.openstack.org/#/c/565604/ > >> Here's a first pass: >> >> The virt driver, via the return value from update_provider_tree, tells >> the resource tracker that "inventory of resource class A on provider B >> have moved to provider C" for all applicable AxBxC. E.g. >> >> [ { 'from_resource_provider': <cn_rp_uuid>, >> 'moved_resources': [VGPU: 4], >> 'to_resource_provider': <gpu_rp1_uuid> >> }, >> { 'from_resource_provider': <cn_rp_uuid>, >> 'moved_resources': [VGPU: 4], >> 'to_resource_provider': <gpu_rp2_uuid> >> }, >> { 'from_resource_provider': <cn_rp_uuid>, >> 'moved_resources': [ >> SRIOV_NET_VF: 2, >> NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 1000, >> NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 1000, >> ], >> 'to_resource_provider': <gpu_rp2_uuid> >> } >> ] >> >> As today, the resource tracker takes the updated provider tree and >> invokes [1] the report client method update_from_provider_tree [2] to >> flush the changes to placement. But now update_from_provider_tree also >> accepts the return value from update_provider_tree and, for each "move": >> >> - Creates provider C (as described in the provider_tree) if it doesn't >> already exist. >> - Creates/updates provider C's inventory as described in the >> provider_tree (without yet updating provider B's inventory). This ought >> to create the inventory of resource class A on provider C. > > Unfortunately, right here you'll introduce a race condition. As soon as > this operation completes, the scheduler will have the ability to throw > new instances on provider C and consume the inventory from it that you > intend to give to the existing instance that is consuming from provider B. > >> - Discovers allocations of rc A on rp B and POSTs to move them to rp C*. > > For each consumer of resources on rp B, right? > >> - Updates provider B's inventory. > > Again, this is problematic because the scheduler will have already begun > to place new instances on B's inventory, which could very well result in > incorrect resource accounting on the node. > > We basically need to have one giant new REST API call that accepts the > list of "move instructions" and performs all of the instructions in a > single transaction. :( > >> (*There's a hole here: if we're splitting a glommed-together inventory >> across multiple new child providers, as the VGPUs in the example, we >> don't know which allocations to put where. The virt driver should know >> which instances own which specific inventory units, and would be able to >> report that info within the data structure. That's getting kinda close >> to the virt driver mucking with allocations, but maybe it fits well >> enough into this model to be acceptable?) > > Well, it's not really the virt driver *itself* mucking with the > allocations. It's more that the virt driver is telling something *else* > the move instructions that it feels are needed... > >> Note that the return value from update_provider_tree is optional, and >> only used when the virt driver is indicating a "move" of this ilk. If >> it's None/[] then the RT/update_from_provider_tree flow is the same as >> it is today. >> >> If we can do it this way, we don't need a migration tool. In fact, we >> don't even need to restrict provider tree "reshaping" to release >> boundaries. As long as the virt driver understands its own data model >> migrations and reports them properly via update_provider_tree, it can >> shuffle its tree around whenever it wants. > > Due to the many race conditions we would have in trying to fudge > inventory amounts (the reserved/total thing) and allocation movement for >>1 consumer at a time, I'm pretty sure the only safe thing to do is have > a single new HTTP endpoint that would take this list of move operations > and perform them atomically (on the placement server side of course). > > Here's a strawman for how that HTTP endpoint might look like: > > https://etherpad.openstack.org/p/placement-migrate-operations > > feel free to markup and destroy. > > Best, > -jay > >> Thoughts? >> >> -efried >> >> [1] >> https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2fc34d7e8e1b/nova/compute/resource_tracker.py#L890 >> >> [2] >> https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2fc34d7e8e1b/nova/scheduler/client/report.py#L1341 >> >> >> __________________________________________________________________________ >> >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: >> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev