Re: [openstack-dev] vGPUs support for Nova

Matt Riedemann Mon, 25 Sep 2017 07:31:01 -0700

On 9/25/2017 5:40 AM, Jay Pipes wrote:

On 09/25/2017 05:39 AM, Sahid Orentino Ferdjaoui wrote:

There is a desire to expose the vGPUs resources on top of Resource
Provider which is probably the path we should be going in the long
term. I was not there for the last PTG and you probably already made a
decision about moving in that direction anyway. My personal feeling is
that it is premature.


The nested Resource Provider work is not yet feature-complete and
requires more reviewer attention. If we continue in the direction of
Resource Provider, it will need at least 2 more releases to expose the
vGPUs feature and that without the support of NUMA, and with the
feeling of pushing something which is not stable/production-ready.

It's seems safer to first have the Resource Provider work well
finalized/stabilized to be production-ready. Then on top of something
stable we could start to migrate our current virt specific features
like NUMA, CPU Pinning, Huge Pages and finally PCI devices.

I'm talking about PCI devices in general because I think we should
implement the vGPU on top of our /pci framework which is production
ready and provides the support of NUMA.

The hardware vendors building their drivers using mdev and the /pci
framework currently understand only SRIOV but on a quick glance it
does not seem complicated to make it support mdev.

In the /pci framework we will have to:

* Update the PciDevice object fields to accept NULL value for
   'address' and add new field 'uuid'
* Update PciRequest to handle a new tag like 'vgpu_types'
* Update PciDeviceStats to also maintain pool of vGPUs

The operators will have to create alias(-es) and configure
flavors. Basically most of the logic is already implemented and the
method 'consume_request' is going to select the right vGPUs according
the request.

In /virt we will have to:

* Update the field 'pci_passthrough_devices' to also include GPUs
   devices.
* Update attach/detach PCI device to handle vGPUs

We have a few people interested in working on it, so we could
certainly make this feature available for Queen.

I can take the lead updating/implementing the PCI and libvirt driver
part, I'm sure Jianghua Wang will be happy to take the lead for the
virt XenServer part.

And I trust Jay, Stephen and Sylvain to follow the developments.

I understand the desire to get something in to Nova to support vGPUs,and I understand that the existing /pci modules represent thefastest/cheapest way to get there.

I won't block you from making any of the above changes, Sahid. I'll evendo my best to review them. However, I will be primarily focusing thiscycle on getting the nested resource providers work feature-complete for(at least) SR-IOV PF/VF devices.

The decision of whether to allow an approach that adds more to theexisting /pci module is ultimately Matt's.


Best,
-jay

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Nested resource providers is not merged or production ready because wehaven't made it a priority. We've certainly talked about it and Jay hashad patches proposed for several releases now though.

Building vGPU support into the existing framework, which only a coupleof people understand - certainly not me, might be a short-term gain butis just more technical debt we have to pay off later, and delays anyfocus on nested resource providers for the wider team.

At the Queens PTG it was abundantly clear that many features aredependent on nested resource providers, including severalnetworking-related features like bandwidth-based scheduling.


The priorities for placement/scheduler in Queens are:

1. Dan Smith's migration allocations cleanup.
2. Alternative hosts for reschedules with cells v2.
3. Nested resource providers.

All of these are in progress and need review.

I personally don't think we should abandon the plan to implement vGPUsupport with nested resource providers without first seeing any codechanges for it as a proof of concept. It also sounds like we have apretty simple staggered plan for rolling out vGPU support so it's notvery detailed to start. The virt driver reports vGPU inventory and wedecorate the details later with traits (which Alex Xu is working on andneeds review).

Sahid, you could certainly implement a separate proof of concept andmake that available if the nested resource providers-based change hitsmajor issues or goes far too long and has too much risk, then we have acontingency plan at least. But I don't expect that to get reviewpriority and you'd have to accept that it might not get merged since wewant to use nested resource providers.

Either way we are going to need solid functional testing and thatfunctional testing should be written against the API as much as possibleso that it works regardless of the backend implementation of thefeature. One of the big things we failed at in Pike was not doing enoughfunctional testing of move operations with claims in the schedulerearlier in the cycle. That all came in late and we're still fixing bugsas a result.

If we can get started early on the functional testing for vGPUs, thenwork both implementations in parallel, we should be able to retain thefunctional tests and determine which implementation we ultimately needto go with probably sometime in the second milestone.


--

Thanks,

Matt

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] vGPUs support for Nova

Reply via email to