Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Ian, I hope that you guys are in agreement on this. But take a look at the wiki: https://wiki.openstack.org/wiki/PCI_passthrough_SRIOV_support and see if it has any difference from your proposals. IMO, it's the critical piece of the proposal, and hasn't been specified in exact term yet. I'm not sure about vif_attributes or vif_stats, which I just heard from you. In any case, I'm not convinced with the flexibility and/or complexity, and so far I haven't seen a use case that really demands it. But I'd be happy to see one. thanks, Robert On 1/29/14 4:43 PM, Ian Wells ijw.ubu...@cack.org.ukmailto:ijw.ubu...@cack.org.uk wrote: My proposals: On 29 January 2014 16:43, Robert Li (baoli) ba...@cisco.commailto:ba...@cisco.com wrote: 1. pci-flavor-attrs is configured through configuration files and will be available on both the controller node and the compute nodes. Can the cloud admin decide to add a new attribute in a running cloud? If that's possible, how is that done? When nova-compute starts up, it requests the VIF attributes that the schedulers need. (You could have multiple schedulers; they could be in disagreement; it picks the last answer.) It returns pci_stats by the selected combination of VIF attributes. When nova-scheduler starts up, it sends an unsolicited cast of the attributes. nova-compute updates the attributes, clears its pci_stats and recreates them. If nova-scheduler receives pci_stats with incorrect attributes it discards them. (There is a row from nova-compute summarising devices for each unique combination of vif_stats, including 'None' where no attribute is set.) I'm assuming here that the pci_flavor_attrs are read on startup of nova-scheduler and could be re-read and different when nova-scheduler is reset. There's a relatively straightforward move from here to an API for setting it if this turns out to be useful, but firstly I think it would be an uncommon occurrence and secondly it's not something we should implement now. 2. PCI flavor will be defined using the attributes in pci-flavor-attrs. A flavor is defined with a matching expression in the form of attr1 = val11 [| val12 Š.], [attr2 = val21 [| val22 Š]], Š. And this expression is used to match one or more PCI stats groups until a free PCI device is located. In this case, both attr1 and attr2 can have multiple values, and both attributes need to be satisfied. Please confirm this understanding is correct This looks right to me as we've discussed it, but I think we'll be wanting something that allows a top level AND. In the above example, I can't say an Intel NIC and a Mellanox NIC are equally OK, because I can't say (intel + product ID 1) AND (Mellanox + product ID 2). I'll leave Yunhong to decice how the details should look, though. 3. I'd like to see an example that involves multiple attributes. let's say pci-flavor-attrs = {gpu, net-group, device_id, product_id}. I'd like to know how PCI stats groups are formed on compute nodes based on that, and how many of PCI stats groups are there? What's the reasonable guidelines in defining the PCI flavors. I need to write up the document for this, and it's overdue. Leave it with me. -- Ian. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi Yongli, Thank you for addressing my comments, and for adding the encryption card use case. One thing that I want to point out is that in this use case, you may not use the pci-flavor in the --nic option because it's not a neutron feature. I have a few more questions: 1. pci-flavor-attrs is configured through configuration files and will be available on both the controller node and the compute nodes. Can the cloud admin decide to add a new attribute in a running cloud? If that's possible, how is that done? 2. PCI flavor will be defined using the attributes in pci-flavor-attrs. A flavor is defined with a matching expression in the form of attr1 = val11 [| val12 Š.], [attr2 = val21 [| val22 Š]], Š. And this expression is used to match one or more PCI stats groups until a free PCI device is located. In this case, both attr1 and attr2 can have multiple values, and both attributes need to be satisfied. Please confirm this understanding is correct 3. I'd like to see an example that involves multiple attributes. let's say pci-flavor-attrs = {gpu, net-group, device_id, product_id}. I'd like to know how PCI stats groups are formed on compute nodes based on that, and how many of PCI stats groups are there? What's the reasonable guidelines in defining the PCI flavors. thanks, Robert On 1/28/14 10:16 PM, Robert Li (baoli) ba...@cisco.com wrote: Hi, I added a few comments in this wiki that Yongli came up with: https://wiki.openstack.org/wiki/PCI_passthrough_SRIOV_support Please check it out and look for Robert in the wiki. Thanks, Robert On 1/21/14 9:55 AM, Robert Li (baoli) ba...@cisco.com wrote: Yunhong, Just try to understand your use case: -- a VM can only work with cards from vendor V1 -- a VM can work with cards from both vendor V1 and V2 So stats in the two flavors will overlap in the PCI flavor solution. I'm just trying to say that this is something that needs to be properly addressed. Just for the sake of discussion, another solution to meeting the above requirement is able to say in the nova flavor's extra-spec encrypt_card = card from vendor V1 OR encrypt_card = card from vendor V2 In other words, this can be solved in the nova flavor, rather than introducing a new flavor. Thanks, Robert On 1/17/14 7:03 PM, yunhong jiang yunhong.ji...@linux.intel.com wrote: On Fri, 2014-01-17 at 22:30 +, Robert Li (baoli) wrote: Yunhong, I'm hoping that these comments can be directly addressed: a practical deployment scenario that requires arbitrary attributes. I'm just strongly against to support only one attributes (your PCI group) for scheduling and management, that's really TOO limited. A simple scenario is, I have 3 encryption card: Card 1 (vendor_id is V1, device_id =0xa) card 2(vendor_id is V1, device_id=0xb) card 3(vendor_id is V2, device_id=0xb) I have two images. One image only support Card 1 and another image support Card 1/3 (or any other combination of the 3 card type). I don't only one attributes will meet such requirement. As to arbitrary attributes or limited list of attributes, my opinion is, as there are so many type of PCI devices and so many potential of PCI devices usage, support arbitrary attributes will make our effort more flexible, if we can push the implementation into the tree. detailed design on the following (that also take into account the introduction of predefined attributes): * PCI stats report since the scheduler is stats based I don't think there are much difference with current implementation. * the scheduler in support of PCI flavors with arbitrary attributes and potential overlapping. As Ian said, we need make sure the pci_stats and the PCI flavor have the same set of attributes, so I don't think there are much difference with current implementation. networking requirements to support multiple provider nets/physical nets Can't the extra info resolve this issue? Can you elaborate the issue? Thanks --jyh I guess that the above will become clear as the discussion goes on. And we also need to define the deliveries Thanks, Robert ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
My proposals: On 29 January 2014 16:43, Robert Li (baoli) ba...@cisco.com wrote: 1. pci-flavor-attrs is configured through configuration files and will be available on both the controller node and the compute nodes. Can the cloud admin decide to add a new attribute in a running cloud? If that's possible, how is that done? When nova-compute starts up, it requests the VIF attributes that the schedulers need. (You could have multiple schedulers; they could be in disagreement; it picks the last answer.) It returns pci_stats by the selected combination of VIF attributes. When nova-scheduler starts up, it sends an unsolicited cast of the attributes. nova-compute updates the attributes, clears its pci_stats and recreates them. If nova-scheduler receives pci_stats with incorrect attributes it discards them. (There is a row from nova-compute summarising devices for each unique combination of vif_stats, including 'None' where no attribute is set.) I'm assuming here that the pci_flavor_attrs are read on startup of nova-scheduler and could be re-read and different when nova-scheduler is reset. There's a relatively straightforward move from here to an API for setting it if this turns out to be useful, but firstly I think it would be an uncommon occurrence and secondly it's not something we should implement now. 2. PCI flavor will be defined using the attributes in pci-flavor-attrs. A flavor is defined with a matching expression in the form of attr1 = val11 [| val12 Š.], [attr2 = val21 [| val22 Š]], Š. And this expression is used to match one or more PCI stats groups until a free PCI device is located. In this case, both attr1 and attr2 can have multiple values, and both attributes need to be satisfied. Please confirm this understanding is correct This looks right to me as we've discussed it, but I think we'll be wanting something that allows a top level AND. In the above example, I can't say an Intel NIC and a Mellanox NIC are equally OK, because I can't say (intel + product ID 1) AND (Mellanox + product ID 2). I'll leave Yunhong to decice how the details should look, though. 3. I'd like to see an example that involves multiple attributes. let's say pci-flavor-attrs = {gpu, net-group, device_id, product_id}. I'd like to know how PCI stats groups are formed on compute nodes based on that, and how many of PCI stats groups are there? What's the reasonable guidelines in defining the PCI flavors. I need to write up the document for this, and it's overdue. Leave it with me. -- Ian. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Document updated to talk about network aware scheduling ( https://docs.google.com/document/d/1vadqmurlnlvZ5bv3BlUbFeXRS_wh-dsgi5plSjimWjU/edit#- section just before the use case list). Based on yesterday's meeting, rkukura would also like to see network-aware scheduling to work for non-PCI cases - where servers are not necessarily connected to every physical segment and machines therefore need placing based on where they can reach the networks they need. I think this is an exact parallel to the PCI case, except that we're also constrained by a count of resources (you can connect an infinite number of VMs to a software bridge, of course). We should implement the scheduling changes as a separate batch of work that solves both problems, if we can - and this works with the two step approach, because step 1 brings us up to Neutron parity and step 2 will add network-aware scheduling for both PCI and non-PCI cases. -- Ian. On 20 January 2014 13:38, Ian Wells ijw.ubu...@cack.org.uk wrote: On 20 January 2014 09:28, Irena Berezovsky ire...@mellanox.com wrote: Hi, Having post PCI meeting discussion with Ian based on his proposal https://docs.google.com/document/d/1vadqmurlnlvZ5bv3BlUbFeXRS_wh-dsgi5plSjimWjU/edit?pli=1# , I am not sure that the case that quite usable for SR-IOV based networking is covered well by this proposal. The understanding I got is that VM can land on the Host that will lack suitable PCI resource. The issue we have is if we have multiple underlying networks in the system and only some Neutron networks are trunked on the network that the PCI device is attached to. This can specifically happen in the case of provider versus trunk networks, though it's very dependent on the setup of your system. The issue is that, in the design we have, Neutron at present has no input into scheduling, and also that all devices in a flavor are precisely equivalent. So if I say 'I want a 10G card attached to network X' I will get one of the cases in the 10G flavor with no regard as to whether it can actually attach to network X. I can see two options here: 1. What I'd do right now is I would make it so that a VM that is given an unsuitable network card fails to run in nova-compute when Neutorn discovers it can't attach the PCI device to the network. This will get us a lot of use cases and a Neutron driver without solving the problem elegantly. You'd need to choose e.g. a provider or tenant network flavor, mindful of the network you're connecting to, so that Neutron can actually succeed, which is more visibility into the workings of Neutron than the user really ought to need. 2. When Nova checks that all the networks exist - which, conveniently, is in nova-api - it also gets attributes from the networks that can be used by the scheduler to choose a device. So the scheduler chooses from a flavor *and*, within that flavor, from a subset of those devices with appopriate connectivity. If we do this then the Neutron connection code doesn't change - it should still fail if the connection can't be made - but it becomes an internal error, since it's now an issue of consistency of setup. To do this, I think we would tell Neutron 'PCI extra-info X should be set to Y for this provider network and Z for tenant networks' - the precise implementation would be somewhat up to the driver - and then add the additional check in the scheduler. The scheduling attributes list would have to include that attribute. Can you please provide an example for the required cloud admin PCI related configurations on nova-compute and controller node with regards to the following simplified scenario: -- There are 2 provider networks (phy1, phy2), each one has associated range on vlan-ids -- Each compute node has 2 vendor adapters with SR-IOV enabled feature, exposing xx Virtual Functions. -- Every VM vnic on virtual network on provider network phy1 or phy2 should be pci pass-through vnic. So, we would configure Neutron to check the 'e.physical_network' attribute on connection and to return it as a requirement on networks. Any PCI on provider network 'phy1' would be tagged e.physical_network = 'phy1'. When returning the network, an extra attribute would be supplied (perhaps something like 'pci_requirements = { e.physical_network = 'phy1'}'. And nova-api would know that, in the case of macvtap and PCI directmap, it would need to pass this additional information to the scheduler which would need to make use of it in finding a device, over and above the flavor requirements. Neutron, when mapping a PCI port, would similarly work out from the Neutron network the trunk it needs to connect to, and would reject any mapping that didn't conform. If it did, it would work out how to encapsulate the traffic from the PCI device and set that up on the PF of the port. I'm not saying this is the only or best solution, but it does have the advantage that it keeps all of
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Just one comment: The devices allocated for an instance are immediately known after the domain is created. Therefore it's possible to do a port update and have the device configured while the instance is booting. --Robert On 1/19/14 2:15 AM, Irena Berezovsky ire...@mellanox.com wrote: Hi Robert, Yonhong, Although network XML solution (option 1) is very elegant, it has one major disadvantage. As Robert mentioned, the disadvantage of the network XML is the inability to know what SR-IOV PCI device was actually allocated. When neutron is responsible to set networking configuration, manage admin status, set security groups, it should be able to identify the SR-IOV PCI device to apply configuration. Within current libvirt Network XML implementation, it does not seem possible. Between option (2) and (3), I do not have any preference, it should be as simple as possible. Option (3) that I raised can be achieved by renaming the network interface of Virtual Function via 'ip link set name'. Interface logical name can be based on neutron port UUID. This will allow neutron to discover devices, if backend plugin requires it. Once VM is migrating, suitable Virtual Function on the target node should be allocated, and then its corresponding network interface should be renamed to same logical name. This can be done without system rebooting. Still need to check how the Virtual Function corresponding network interface can be returned to its original name once is not used anymore as VM vNIC. Regards, Irena -Original Message- From: Jiang, Yunhong [mailto:yunhong.ji...@intel.com] Sent: Friday, January 17, 2014 9:06 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Robert, thanks for your long reply. Personally I'd prefer option 2/3 as it keep Nova the only entity for PCI management. Glad you are ok with Ian's proposal and we have solution to resolve the libvirt network scenario in that framework. Thanks --jyh -Original Message- From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Friday, January 17, 2014 7:08 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Yunhong, Thank you for bringing that up on the live migration support. In addition to the two solutions you mentioned, Irena has a different solution. Let me put all the them here again: 1. network xml/group based solution. In this solution, each host that supports a provider net/physical net can define a SRIOV group (it's hard to avoid the term as you can see from the suggestion you made based on the PCI flavor proposal). For each SRIOV group supported on a compute node, A network XML will be created the first time the nova compute service is running on that node. * nova will conduct scheduling, but not PCI device allocation * it's a simple and clean solution, documented in libvirt as the way to support live migration with SRIOV. In addition, a network xml is nicely mapped into a provider net. 2. network xml per PCI device based solution This is the solution you brought up in this email, and Ian mentioned this to me as well. In this solution, a network xml is created when A VM is created. the network xml needs to be removed once the VM is removed. This hasn't been tried out as far as I know. 3. interface xml/interface rename based solution Irena brought this up. In this solution, the ethernet interface name corresponding to the PCI device attached to the VM needs to be renamed. One way to do so without requiring system reboot is to change the udev rule's file for interface renaming, followed by a udev reload. Now, with the first solution, Nova doesn't seem to have control over or visibility of the PCI device allocated for the VM before the VM is launched. This needs to be confirmed with the libvirt support and see if such capability can be provided. This may be a potential drawback if a neutron plugin requires detailed PCI device information for operation. Irena may provide more insight into this. Ideally, neutron shouldn't need this information because the device configuration can be done by libvirt invoking the PCI device driver. The other two solutions are similar. For example, you can view the second solution as one way to rename an interface, or camouflage an interface under a network name. They all require additional works before the VM is created and after the VM is removed. I also agree with you that we should take a look at XenAPI on this. With regard to your suggestion on how to implement the first solution with some predefined group attribute, I think it definitely can be done. As I have pointed it out earlier, the PCI flavor proposal is actually a generalized version of the PCI group. In other words, in the PCI group
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Yunhong, Just try to understand your use case: -- a VM can only work with cards from vendor V1 -- a VM can work with cards from both vendor V1 and V2 So stats in the two flavors will overlap in the PCI flavor solution. I'm just trying to say that this is something that needs to be properly addressed. Just for the sake of discussion, another solution to meeting the above requirement is able to say in the nova flavor's extra-spec encrypt_card = card from vendor V1 OR encrypt_card = card from vendor V2 In other words, this can be solved in the nova flavor, rather than introducing a new flavor. Thanks, Robert On 1/17/14 7:03 PM, yunhong jiang yunhong.ji...@linux.intel.com wrote: On Fri, 2014-01-17 at 22:30 +, Robert Li (baoli) wrote: Yunhong, I'm hoping that these comments can be directly addressed: a practical deployment scenario that requires arbitrary attributes. I'm just strongly against to support only one attributes (your PCI group) for scheduling and management, that's really TOO limited. A simple scenario is, I have 3 encryption card: Card 1 (vendor_id is V1, device_id =0xa) card 2(vendor_id is V1, device_id=0xb) card 3(vendor_id is V2, device_id=0xb) I have two images. One image only support Card 1 and another image support Card 1/3 (or any other combination of the 3 card type). I don't only one attributes will meet such requirement. As to arbitrary attributes or limited list of attributes, my opinion is, as there are so many type of PCI devices and so many potential of PCI devices usage, support arbitrary attributes will make our effort more flexible, if we can push the implementation into the tree. detailed design on the following (that also take into account the introduction of predefined attributes): * PCI stats report since the scheduler is stats based I don't think there are much difference with current implementation. * the scheduler in support of PCI flavors with arbitrary attributes and potential overlapping. As Ian said, we need make sure the pci_stats and the PCI flavor have the same set of attributes, so I don't think there are much difference with current implementation. networking requirements to support multiple provider nets/physical nets Can't the extra info resolve this issue? Can you elaborate the issue? Thanks --jyh I guess that the above will become clear as the discussion goes on. And we also need to define the deliveries Thanks, Robert ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi, Having post PCI meeting discussion with Ian based on his proposal https://docs.google.com/document/d/1vadqmurlnlvZ5bv3BlUbFeXRS_wh-dsgi5plSjimWjU/edit?pli=1#, I am not sure that the case that quite usable for SR-IOV based networking is covered well by this proposal. The understanding I got is that VM can land on the Host that will lack suitable PCI resource. Can you please provide an example for the required cloud admin PCI related configurations on nova-compute and controller node with regards to the following simplified scenario: -- There are 2 provider networks (phy1, phy2), each one has associated range on vlan-ids -- Each compute node has 2 vendor adapters with SR-IOV enabled feature, exposing xx Virtual Functions. -- Every VM vnic on virtual network on provider network phy1 or phy2 should be pci pass-through vnic. Thanks a lot, Irena -Original Message- From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Saturday, January 18, 2014 12:33 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Yunhong, I'm hoping that these comments can be directly addressed: a practical deployment scenario that requires arbitrary attributes. detailed design on the following (that also take into account the introduction of predefined attributes): * PCI stats report since the scheduler is stats based * the scheduler in support of PCI flavors with arbitrary attributes and potential overlapping. networking requirements to support multiple provider nets/physical nets I guess that the above will become clear as the discussion goes on. And we also need to define the deliveries Thanks, Robert On 1/17/14 2:02 PM, Jiang, Yunhong yunhong.ji...@intel.com wrote: Robert, thanks for your long reply. Personally I'd prefer option 2/3 as it keep Nova the only entity for PCI management. Glad you are ok with Ian's proposal and we have solution to resolve the libvirt network scenario in that framework. Thanks --jyh -Original Message- From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Friday, January 17, 2014 7:08 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Yunhong, Thank you for bringing that up on the live migration support. In addition to the two solutions you mentioned, Irena has a different solution. Let me put all the them here again: 1. network xml/group based solution. In this solution, each host that supports a provider net/physical net can define a SRIOV group (it's hard to avoid the term as you can see from the suggestion you made based on the PCI flavor proposal). For each SRIOV group supported on a compute node, A network XML will be created the first time the nova compute service is running on that node. * nova will conduct scheduling, but not PCI device allocation * it's a simple and clean solution, documented in libvirt as the way to support live migration with SRIOV. In addition, a network xml is nicely mapped into a provider net. 2. network xml per PCI device based solution This is the solution you brought up in this email, and Ian mentioned this to me as well. In this solution, a network xml is created when A VM is created. the network xml needs to be removed once the VM is removed. This hasn't been tried out as far as I know. 3. interface xml/interface rename based solution Irena brought this up. In this solution, the ethernet interface name corresponding to the PCI device attached to the VM needs to be renamed. One way to do so without requiring system reboot is to change the udev rule's file for interface renaming, followed by a udev reload. Now, with the first solution, Nova doesn't seem to have control over or visibility of the PCI device allocated for the VM before the VM is launched. This needs to be confirmed with the libvirt support and see if such capability can be provided. This may be a potential drawback if a neutron plugin requires detailed PCI device information for operation. Irena may provide more insight into this. Ideally, neutron shouldn't need this information because the device configuration can be done by libvirt invoking the PCI device driver. The other two solutions are similar. For example, you can view the second solution as one way to rename an interface, or camouflage an interface under a network name. They all require additional works before the VM is created and after the VM is removed. I also agree with you that we should take a look at XenAPI on this. With regard to your suggestion on how to implement the first solution with some predefined group attribute, I think it definitely can be done. As I have pointed it out earlier, the PCI flavor proposal is actually
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On 20 January 2014 09:28, Irena Berezovsky ire...@mellanox.com wrote: Hi, Having post PCI meeting discussion with Ian based on his proposal https://docs.google.com/document/d/1vadqmurlnlvZ5bv3BlUbFeXRS_wh-dsgi5plSjimWjU/edit?pli=1# , I am not sure that the case that quite usable for SR-IOV based networking is covered well by this proposal. The understanding I got is that VM can land on the Host that will lack suitable PCI resource. The issue we have is if we have multiple underlying networks in the system and only some Neutron networks are trunked on the network that the PCI device is attached to. This can specifically happen in the case of provider versus trunk networks, though it's very dependent on the setup of your system. The issue is that, in the design we have, Neutron at present has no input into scheduling, and also that all devices in a flavor are precisely equivalent. So if I say 'I want a 10G card attached to network X' I will get one of the cases in the 10G flavor with no regard as to whether it can actually attach to network X. I can see two options here: 1. What I'd do right now is I would make it so that a VM that is given an unsuitable network card fails to run in nova-compute when Neutorn discovers it can't attach the PCI device to the network. This will get us a lot of use cases and a Neutron driver without solving the problem elegantly. You'd need to choose e.g. a provider or tenant network flavor, mindful of the network you're connecting to, so that Neutron can actually succeed, which is more visibility into the workings of Neutron than the user really ought to need. 2. When Nova checks that all the networks exist - which, conveniently, is in nova-api - it also gets attributes from the networks that can be used by the scheduler to choose a device. So the scheduler chooses from a flavor *and*, within that flavor, from a subset of those devices with appopriate connectivity. If we do this then the Neutron connection code doesn't change - it should still fail if the connection can't be made - but it becomes an internal error, since it's now an issue of consistency of setup. To do this, I think we would tell Neutron 'PCI extra-info X should be set to Y for this provider network and Z for tenant networks' - the precise implementation would be somewhat up to the driver - and then add the additional check in the scheduler. The scheduling attributes list would have to include that attribute. Can you please provide an example for the required cloud admin PCI related configurations on nova-compute and controller node with regards to the following simplified scenario: -- There are 2 provider networks (phy1, phy2), each one has associated range on vlan-ids -- Each compute node has 2 vendor adapters with SR-IOV enabled feature, exposing xx Virtual Functions. -- Every VM vnic on virtual network on provider network phy1 or phy2 should be pci pass-through vnic. So, we would configure Neutron to check the 'e.physical_network' attribute on connection and to return it as a requirement on networks. Any PCI on provider network 'phy1' would be tagged e.physical_network = 'phy1'. When returning the network, an extra attribute would be supplied (perhaps something like 'pci_requirements = { e.physical_network = 'phy1'}'. And nova-api would know that, in the case of macvtap and PCI directmap, it would need to pass this additional information to the scheduler which would need to make use of it in finding a device, over and above the flavor requirements. Neutron, when mapping a PCI port, would similarly work out from the Neutron network the trunk it needs to connect to, and would reject any mapping that didn't conform. If it did, it would work out how to encapsulate the traffic from the PCI device and set that up on the PF of the port. I'm not saying this is the only or best solution, but it does have the advantage that it keeps all of the networking behaviour in Neutron - hopefully Nova remains almost completely ignorant of what the network setup is, since the only thing we have to do is pass on PCI requirements, and we already have a convenient call flow we can use that's there for the network existence check. -- Ian. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi Robert, Yonhong, Although network XML solution (option 1) is very elegant, it has one major disadvantage. As Robert mentioned, the disadvantage of the network XML is the inability to know what SR-IOV PCI device was actually allocated. When neutron is responsible to set networking configuration, manage admin status, set security groups, it should be able to identify the SR-IOV PCI device to apply configuration. Within current libvirt Network XML implementation, it does not seem possible. Between option (2) and (3), I do not have any preference, it should be as simple as possible. Option (3) that I raised can be achieved by renaming the network interface of Virtual Function via 'ip link set name'. Interface logical name can be based on neutron port UUID. This will allow neutron to discover devices, if backend plugin requires it. Once VM is migrating, suitable Virtual Function on the target node should be allocated, and then its corresponding network interface should be renamed to same logical name. This can be done without system rebooting. Still need to check how the Virtual Function corresponding network interface can be returned to its original name once is not used anymore as VM vNIC. Regards, Irena -Original Message- From: Jiang, Yunhong [mailto:yunhong.ji...@intel.com] Sent: Friday, January 17, 2014 9:06 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Robert, thanks for your long reply. Personally I'd prefer option 2/3 as it keep Nova the only entity for PCI management. Glad you are ok with Ian's proposal and we have solution to resolve the libvirt network scenario in that framework. Thanks --jyh -Original Message- From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Friday, January 17, 2014 7:08 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Yunhong, Thank you for bringing that up on the live migration support. In addition to the two solutions you mentioned, Irena has a different solution. Let me put all the them here again: 1. network xml/group based solution. In this solution, each host that supports a provider net/physical net can define a SRIOV group (it's hard to avoid the term as you can see from the suggestion you made based on the PCI flavor proposal). For each SRIOV group supported on a compute node, A network XML will be created the first time the nova compute service is running on that node. * nova will conduct scheduling, but not PCI device allocation * it's a simple and clean solution, documented in libvirt as the way to support live migration with SRIOV. In addition, a network xml is nicely mapped into a provider net. 2. network xml per PCI device based solution This is the solution you brought up in this email, and Ian mentioned this to me as well. In this solution, a network xml is created when A VM is created. the network xml needs to be removed once the VM is removed. This hasn't been tried out as far as I know. 3. interface xml/interface rename based solution Irena brought this up. In this solution, the ethernet interface name corresponding to the PCI device attached to the VM needs to be renamed. One way to do so without requiring system reboot is to change the udev rule's file for interface renaming, followed by a udev reload. Now, with the first solution, Nova doesn't seem to have control over or visibility of the PCI device allocated for the VM before the VM is launched. This needs to be confirmed with the libvirt support and see if such capability can be provided. This may be a potential drawback if a neutron plugin requires detailed PCI device information for operation. Irena may provide more insight into this. Ideally, neutron shouldn't need this information because the device configuration can be done by libvirt invoking the PCI device driver. The other two solutions are similar. For example, you can view the second solution as one way to rename an interface, or camouflage an interface under a network name. They all require additional works before the VM is created and after the VM is removed. I also agree with you that we should take a look at XenAPI on this. With regard to your suggestion on how to implement the first solution with some predefined group attribute, I think it definitely can be done. As I have pointed it out earlier, the PCI flavor proposal is actually a generalized version of the PCI group. In other words, in the PCI group proposal, we have one predefined attribute called PCI group, and everything else works on top of that. In the PCI flavor proposal, attribute is arbitrary. So certainly we can define a particular attribute for networking, which let's temporarily
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Yunhong, Thank you for bringing that up on the live migration support. In addition to the two solutions you mentioned, Irena has a different solution. Let me put all the them here again: 1. network xml/group based solution. In this solution, each host that supports a provider net/physical net can define a SRIOV group (it's hard to avoid the term as you can see from the suggestion you made based on the PCI flavor proposal). For each SRIOV group supported on a compute node, A network XML will be created the first time the nova compute service is running on that node. * nova will conduct scheduling, but not PCI device allocation * it's a simple and clean solution, documented in libvirt as the way to support live migration with SRIOV. In addition, a network xml is nicely mapped into a provider net. 2. network xml per PCI device based solution This is the solution you brought up in this email, and Ian mentioned this to me as well. In this solution, a network xml is created when A VM is created. the network xml needs to be removed once the VM is removed. This hasn't been tried out as far as I know. 3. interface xml/interface rename based solution Irena brought this up. In this solution, the ethernet interface name corresponding to the PCI device attached to the VM needs to be renamed. One way to do so without requiring system reboot is to change the udev rule's file for interface renaming, followed by a udev reload. Now, with the first solution, Nova doesn't seem to have control over or visibility of the PCI device allocated for the VM before the VM is launched. This needs to be confirmed with the libvirt support and see if such capability can be provided. This may be a potential drawback if a neutron plugin requires detailed PCI device information for operation. Irena may provide more insight into this. Ideally, neutron shouldn't need this information because the device configuration can be done by libvirt invoking the PCI device driver. The other two solutions are similar. For example, you can view the second solution as one way to rename an interface, or camouflage an interface under a network name. They all require additional works before the VM is created and after the VM is removed. I also agree with you that we should take a look at XenAPI on this. With regard to your suggestion on how to implement the first solution with some predefined group attribute, I think it definitely can be done. As I have pointed it out earlier, the PCI flavor proposal is actually a generalized version of the PCI group. In other words, in the PCI group proposal, we have one predefined attribute called PCI group, and everything else works on top of that. In the PCI flavor proposal, attribute is arbitrary. So certainly we can define a particular attribute for networking, which let's temporarily call sriov_group. But I can see with this idea of predefined attributes, more of them will be required by different types of devices in the future. I'm sure it will keep us busy although I'm not sure it's in a good way. I was expecting you or someone else can provide a practical deployment scenario that would justify the flexibilities and the complexities. Although I'd prefer to keep it simple and generalize it later once a particular requirement is clearly identified, I'm fine to go with it if that's most of the folks want to do. --Robert On 1/16/14 8:36 PM, yunhong jiang yunhong.ji...@linux.intel.com wrote: On Thu, 2014-01-16 at 01:28 +0100, Ian Wells wrote: To clarify a couple of Robert's points, since we had a conversation earlier: On 15 January 2014 23:47, Robert Li (baoli) ba...@cisco.com wrote: --- do we agree that BDF address (or device id, whatever you call it), and node id shouldn't be used as attributes in defining a PCI flavor? Note that the current spec doesn't actually exclude it as an option. It's just an unwise thing to do. In theory, you could elect to define your flavors using the BDF attribute but determining 'the card in this slot is equivalent to all the other cards in the same slot in other machines' is probably not the best idea... We could lock it out as an option or we could just assume that administrators wouldn't be daft enough to try. * the compute node needs to know the PCI flavor. [...] - to support live migration, we need to use it to create network xml I didn't understand this at first and it took me a while to get what Robert meant here. This is based on Robert's current code for macvtap based live migration. The issue is that if you wish to migrate a VM and it's tied to a physical interface, you can't guarantee that the same physical interface is going to be used on the target machine, but at the same time you can't change the libvirt.xml as it comes over with the migrating machine. The answer is to define a network
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Robert, thanks for your long reply. Personally I'd prefer option 2/3 as it keep Nova the only entity for PCI management. Glad you are ok with Ian's proposal and we have solution to resolve the libvirt network scenario in that framework. Thanks --jyh -Original Message- From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Friday, January 17, 2014 7:08 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Yunhong, Thank you for bringing that up on the live migration support. In addition to the two solutions you mentioned, Irena has a different solution. Let me put all the them here again: 1. network xml/group based solution. In this solution, each host that supports a provider net/physical net can define a SRIOV group (it's hard to avoid the term as you can see from the suggestion you made based on the PCI flavor proposal). For each SRIOV group supported on a compute node, A network XML will be created the first time the nova compute service is running on that node. * nova will conduct scheduling, but not PCI device allocation * it's a simple and clean solution, documented in libvirt as the way to support live migration with SRIOV. In addition, a network xml is nicely mapped into a provider net. 2. network xml per PCI device based solution This is the solution you brought up in this email, and Ian mentioned this to me as well. In this solution, a network xml is created when A VM is created. the network xml needs to be removed once the VM is removed. This hasn't been tried out as far as I know. 3. interface xml/interface rename based solution Irena brought this up. In this solution, the ethernet interface name corresponding to the PCI device attached to the VM needs to be renamed. One way to do so without requiring system reboot is to change the udev rule's file for interface renaming, followed by a udev reload. Now, with the first solution, Nova doesn't seem to have control over or visibility of the PCI device allocated for the VM before the VM is launched. This needs to be confirmed with the libvirt support and see if such capability can be provided. This may be a potential drawback if a neutron plugin requires detailed PCI device information for operation. Irena may provide more insight into this. Ideally, neutron shouldn't need this information because the device configuration can be done by libvirt invoking the PCI device driver. The other two solutions are similar. For example, you can view the second solution as one way to rename an interface, or camouflage an interface under a network name. They all require additional works before the VM is created and after the VM is removed. I also agree with you that we should take a look at XenAPI on this. With regard to your suggestion on how to implement the first solution with some predefined group attribute, I think it definitely can be done. As I have pointed it out earlier, the PCI flavor proposal is actually a generalized version of the PCI group. In other words, in the PCI group proposal, we have one predefined attribute called PCI group, and everything else works on top of that. In the PCI flavor proposal, attribute is arbitrary. So certainly we can define a particular attribute for networking, which let's temporarily call sriov_group. But I can see with this idea of predefined attributes, more of them will be required by different types of devices in the future. I'm sure it will keep us busy although I'm not sure it's in a good way. I was expecting you or someone else can provide a practical deployment scenario that would justify the flexibilities and the complexities. Although I'd prefer to keep it simple and generalize it later once a particular requirement is clearly identified, I'm fine to go with it if that's most of the folks want to do. --Robert On 1/16/14 8:36 PM, yunhong jiang yunhong.ji...@linux.intel.com wrote: On Thu, 2014-01-16 at 01:28 +0100, Ian Wells wrote: To clarify a couple of Robert's points, since we had a conversation earlier: On 15 January 2014 23:47, Robert Li (baoli) ba...@cisco.com wrote: --- do we agree that BDF address (or device id, whatever you call it), and node id shouldn't be used as attributes in defining a PCI flavor? Note that the current spec doesn't actually exclude it as an option. It's just an unwise thing to do. In theory, you could elect to define your flavors using the BDF attribute but determining 'the card in this slot is equivalent to all the other cards in the same slot in other machines' is probably not the best idea... We could lock it out as an option or we could just assume that administrators wouldn't be daft enough to try. * the compute node needs to know the PCI flavor
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Yunhong, I'm hoping that these comments can be directly addressed: a practical deployment scenario that requires arbitrary attributes. detailed design on the following (that also take into account the introduction of predefined attributes): * PCI stats report since the scheduler is stats based * the scheduler in support of PCI flavors with arbitrary attributes and potential overlapping. networking requirements to support multiple provider nets/physical nets I guess that the above will become clear as the discussion goes on. And we also need to define the deliveries Thanks, Robert On 1/17/14 2:02 PM, Jiang, Yunhong yunhong.ji...@intel.com wrote: Robert, thanks for your long reply. Personally I'd prefer option 2/3 as it keep Nova the only entity for PCI management. Glad you are ok with Ian's proposal and we have solution to resolve the libvirt network scenario in that framework. Thanks --jyh -Original Message- From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Friday, January 17, 2014 7:08 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Yunhong, Thank you for bringing that up on the live migration support. In addition to the two solutions you mentioned, Irena has a different solution. Let me put all the them here again: 1. network xml/group based solution. In this solution, each host that supports a provider net/physical net can define a SRIOV group (it's hard to avoid the term as you can see from the suggestion you made based on the PCI flavor proposal). For each SRIOV group supported on a compute node, A network XML will be created the first time the nova compute service is running on that node. * nova will conduct scheduling, but not PCI device allocation * it's a simple and clean solution, documented in libvirt as the way to support live migration with SRIOV. In addition, a network xml is nicely mapped into a provider net. 2. network xml per PCI device based solution This is the solution you brought up in this email, and Ian mentioned this to me as well. In this solution, a network xml is created when A VM is created. the network xml needs to be removed once the VM is removed. This hasn't been tried out as far as I know. 3. interface xml/interface rename based solution Irena brought this up. In this solution, the ethernet interface name corresponding to the PCI device attached to the VM needs to be renamed. One way to do so without requiring system reboot is to change the udev rule's file for interface renaming, followed by a udev reload. Now, with the first solution, Nova doesn't seem to have control over or visibility of the PCI device allocated for the VM before the VM is launched. This needs to be confirmed with the libvirt support and see if such capability can be provided. This may be a potential drawback if a neutron plugin requires detailed PCI device information for operation. Irena may provide more insight into this. Ideally, neutron shouldn't need this information because the device configuration can be done by libvirt invoking the PCI device driver. The other two solutions are similar. For example, you can view the second solution as one way to rename an interface, or camouflage an interface under a network name. They all require additional works before the VM is created and after the VM is removed. I also agree with you that we should take a look at XenAPI on this. With regard to your suggestion on how to implement the first solution with some predefined group attribute, I think it definitely can be done. As I have pointed it out earlier, the PCI flavor proposal is actually a generalized version of the PCI group. In other words, in the PCI group proposal, we have one predefined attribute called PCI group, and everything else works on top of that. In the PCI flavor proposal, attribute is arbitrary. So certainly we can define a particular attribute for networking, which let's temporarily call sriov_group. But I can see with this idea of predefined attributes, more of them will be required by different types of devices in the future. I'm sure it will keep us busy although I'm not sure it's in a good way. I was expecting you or someone else can provide a practical deployment scenario that would justify the flexibilities and the complexities. Although I'd prefer to keep it simple and generalize it later once a particular requirement is clearly identified, I'm fine to go with it if that's most of the folks want to do. --Robert On 1/16/14 8:36 PM, yunhong jiang yunhong.ji...@linux.intel.com wrote: On Thu, 2014-01-16 at 01:28 +0100, Ian Wells wrote: To clarify a couple of Robert's points, since we had a conversation earlier: On 15 January 2014 23:47, Robert Li (baoli) ba...@cisco.com wrote
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On Fri, 2014-01-17 at 22:30 +, Robert Li (baoli) wrote: Yunhong, I'm hoping that these comments can be directly addressed: a practical deployment scenario that requires arbitrary attributes. I'm just strongly against to support only one attributes (your PCI group) for scheduling and management, that's really TOO limited. A simple scenario is, I have 3 encryption card: Card 1 (vendor_id is V1, device_id =0xa) card 2(vendor_id is V1, device_id=0xb) card 3(vendor_id is V2, device_id=0xb) I have two images. One image only support Card 1 and another image support Card 1/3 (or any other combination of the 3 card type). I don't only one attributes will meet such requirement. As to arbitrary attributes or limited list of attributes, my opinion is, as there are so many type of PCI devices and so many potential of PCI devices usage, support arbitrary attributes will make our effort more flexible, if we can push the implementation into the tree. detailed design on the following (that also take into account the introduction of predefined attributes): * PCI stats report since the scheduler is stats based I don't think there are much difference with current implementation. * the scheduler in support of PCI flavors with arbitrary attributes and potential overlapping. As Ian said, we need make sure the pci_stats and the PCI flavor have the same set of attributes, so I don't think there are much difference with current implementation. networking requirements to support multiple provider nets/physical nets Can't the extra info resolve this issue? Can you elaborate the issue? Thanks --jyh I guess that the above will become clear as the discussion goes on. And we also need to define the deliveries Thanks, Robert ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On 2014?01?16? 08:28, Ian Wells wrote: To clarify a couple of Robert's points, since we had a conversation earlier: On 15 January 2014 23:47, Robert Li (baoli) ba...@cisco.com mailto:ba...@cisco.com wrote: --- do we agree that BDF address (or device id, whatever you call it), and node id shouldn't be used as attributes in defining a PCI flavor? Note that the current spec doesn't actually exclude it as an option. It's just an unwise thing to do. In theory, you could elect to define your flavors using the BDF attribute but determining 'the card in this slot is equivalent to all the other cards in the same slot in other machines' is probably not the best idea... We could lock it out as an option or we could just assume that administrators wouldn't be daft enough to try. * the compute node needs to know the PCI flavor. [...] - to support live migration, we need to use it to create network xml I didn't understand this at first and it took me a while to get what Robert meant here. This is based on Robert's current code for macvtap based live migration. The issue is that if you wish to migrate a VM and it's tied to a physical interface, you can't guarantee that the same physical interface is going to be used on the target machine, but at the same time you can't change the libvirt.xml as it comes over with the migrating machine. The answer is to define a network and refer out to it from libvirt.xml. In Robert's current code he's using the group name of the PCI devices to create a network containing the list of equivalent devices (those in the group) that can be macvtapped. Thus when the host migrates it will find another, equivalent, interface. This falls over in the use case under but, with flavor we defined, the group could be a tag for this purpose, and all Robert's design still work, so it ok, right? consideration where a device can be mapped using more than one flavor, so we have to discard the use case or rethink the implementation. There's a more complex solution - I think - where we create a temporary network for each macvtap interface a machine's going to use, with a name based on the instance UUID and port number, and containing the device to map. Before starting the migration we would create a replacement network containing only the new device on the target host; migration would find the network from the name in the libvirt.xml, and the content of that network would behave identically. We'd be creating libvirt networks on the fly and a lot more of them, and we'd need decent cleanup code too ('when freeing a PCI device, delete any network it's a member of'), so it all becomes a lot more hairy. -- Ian. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Ian, Thank you for putting in writing the ongoing discussed specification. I have added few comments on the Google doc [1]. As for live migration support, this can be done also without libvirt network usage. Not very elegant, but working: rename the interface of the PCI device to some logical name, let's say based on neutron port UUID and put it into the interface XML, i.e.: If PCI device network interface name is eth8 and neutron port UUID is 02bc4aec-b4f4-436f-b651-024 then rename it to something like: eth02bc4aec-b4'. The interface XML will look like this: ... interface type='direct' mac address='fa:16:3e:46:d3:e8'/ source dev='eth02bc4aec-b4' mode='passthrough'/ target dev='macvtap0'/ model type='virtio'/ alias name='net0'/ address type='pci' domain='0x' bus='0x00' slot='0x03' function='0x0'/ /interface ... [1] https://docs.google.com/document/d/1vadqmurlnlvZ5bv3BlUbFeXRS_wh-dsgi5plSjimWjU/edit?pli=1#heading=h.308b0wqn1zde BR, Irena From: Ian Wells [mailto:ijw.ubu...@cack.org.uk] Sent: Thursday, January 16, 2014 2:34 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support To clarify a couple of Robert's points, since we had a conversation earlier: On 15 January 2014 23:47, Robert Li (baoli) ba...@cisco.commailto:ba...@cisco.com wrote: --- do we agree that BDF address (or device id, whatever you call it), and node id shouldn't be used as attributes in defining a PCI flavor? Note that the current spec doesn't actually exclude it as an option. It's just an unwise thing to do. In theory, you could elect to define your flavors using the BDF attribute but determining 'the card in this slot is equivalent to all the other cards in the same slot in other machines' is probably not the best idea... We could lock it out as an option or we could just assume that administrators wouldn't be daft enough to try. * the compute node needs to know the PCI flavor. [...] - to support live migration, we need to use it to create network xml I didn't understand this at first and it took me a while to get what Robert meant here. This is based on Robert's current code for macvtap based live migration. The issue is that if you wish to migrate a VM and it's tied to a physical interface, you can't guarantee that the same physical interface is going to be used on the target machine, but at the same time you can't change the libvirt.xml as it comes over with the migrating machine. The answer is to define a network and refer out to it from libvirt.xml. In Robert's current code he's using the group name of the PCI devices to create a network containing the list of equivalent devices (those in the group) that can be macvtapped. Thus when the host migrates it will find another, equivalent, interface. This falls over in the use case under consideration where a device can be mapped using more than one flavor, so we have to discard the use case or rethink the implementation. There's a more complex solution - I think - where we create a temporary network for each macvtap interface a machine's going to use, with a name based on the instance UUID and port number, and containing the device to map. Before starting the migration we would create a replacement network containing only the new device on the target host; migration would find the network from the name in the libvirt.xml, and the content of that network would behave identically. We'd be creating libvirt networks on the fly and a lot more of them, and we'd need decent cleanup code too ('when freeing a PCI device, delete any network it's a member of'), so it all becomes a lot more hairy. -- Ian. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On 16 January 2014 09:07, yongli he yongli...@intel.com wrote: On 2014年01月16日 08:28, Ian Wells wrote: This is based on Robert's current code for macvtap based live migration. The issue is that if you wish to migrate a VM and it's tied to a physical interface, you can't guarantee that the same physical interface is going to be used on the target machine, but at the same time you can't change the libvirt.xml as it comes over with the migrating machine. The answer is to define a network and refer out to it from libvirt.xml. In Robert's current code he's using the group name of the PCI devices to create a network containing the list of equivalent devices (those in the group) that can be macvtapped. Thus when the host migrates it will find another, equivalent, interface. This falls over in the use case under but, with flavor we defined, the group could be a tag for this purpose, and all Robert's design still work, so it ok, right? Well, you could make a label up consisting of the values of the attributes in the group, but since a flavor can encompass multiple groups (for instance, I group by device and vendor and then I use two device types in my flavor) this still doesn't work. Irena's solution does, though. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi Irena, Thanks for pointing out an alternative to the network xml solution to live migration. I am still not clear about the solution. Some questions: 1. Where does the rename of the PCI device network interface name occur? 2. Can this rename be done for a VF? I think your example shows rename of a PF. Thanks, Sandhya From: Irena Berezovsky ire...@mellanox.commailto:ire...@mellanox.com Reply-To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Date: Thursday, January 16, 2014 4:43 AM To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Ian, Thank you for putting in writing the ongoing discussed specification. I have added few comments on the Google doc [1]. As for live migration support, this can be done also without libvirt network usage. Not very elegant, but working: rename the interface of the PCI device to some logical name, let’s say based on neutron port UUID and put it into the interface XML, i.e.: If PCI device network interface name is eth8 and neutron port UUID is 02bc4aec-b4f4-436f-b651-024 then rename it to something like: eth02bc4aec-b4'. The interface XML will look like this: ... interface type='direct' mac address='fa:16:3e:46:d3:e8'/ source dev='eth02bc4aec-b4' mode='passthrough'/ target dev='macvtap0'/ model type='virtio'/ alias name='net0'/ address type='pci' domain='0x' bus='0x00' slot='0x03' function='0x0'/ /interface ... [1] https://docs.google.com/document/d/1vadqmurlnlvZ5bv3BlUbFeXRS_wh-dsgi5plSjimWjU/edit?pli=1#heading=h.308b0wqn1zde BR, Irena From: Ian Wells [mailto:ijw.ubu...@cack.org.uk] Sent: Thursday, January 16, 2014 2:34 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support To clarify a couple of Robert's points, since we had a conversation earlier: On 15 January 2014 23:47, Robert Li (baoli) ba...@cisco.commailto:ba...@cisco.com wrote: --- do we agree that BDF address (or device id, whatever you call it), and node id shouldn't be used as attributes in defining a PCI flavor? Note that the current spec doesn't actually exclude it as an option. It's just an unwise thing to do. In theory, you could elect to define your flavors using the BDF attribute but determining 'the card in this slot is equivalent to all the other cards in the same slot in other machines' is probably not the best idea... We could lock it out as an option or we could just assume that administrators wouldn't be daft enough to try. * the compute node needs to know the PCI flavor. [...] - to support live migration, we need to use it to create network xml I didn't understand this at first and it took me a while to get what Robert meant here. This is based on Robert's current code for macvtap based live migration. The issue is that if you wish to migrate a VM and it's tied to a physical interface, you can't guarantee that the same physical interface is going to be used on the target machine, but at the same time you can't change the libvirt.xml as it comes over with the migrating machine. The answer is to define a network and refer out to it from libvirt.xml. In Robert's current code he's using the group name of the PCI devices to create a network containing the list of equivalent devices (those in the group) that can be macvtapped. Thus when the host migrates it will find another, equivalent, interface. This falls over in the use case under consideration where a device can be mapped using more than one flavor, so we have to discard the use case or rethink the implementation. There's a more complex solution - I think - where we create a temporary network for each macvtap interface a machine's going to use, with a name based on the instance UUID and port number, and containing the device to map. Before starting the migration we would create a replacement network containing only the new device on the target host; migration would find the network from the name in the libvirt.xml, and the content of that network would behave identically. We'd be creating libvirt networks on the fly and a lot more of them, and we'd need decent cleanup code too ('when freeing a PCI device, delete any network it's a member of'), so it all becomes a lot more hairy. -- Ian. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On Thu, 2014-01-16 at 01:28 +0100, Ian Wells wrote: To clarify a couple of Robert's points, since we had a conversation earlier: On 15 January 2014 23:47, Robert Li (baoli) ba...@cisco.com wrote: --- do we agree that BDF address (or device id, whatever you call it), and node id shouldn't be used as attributes in defining a PCI flavor? Note that the current spec doesn't actually exclude it as an option. It's just an unwise thing to do. In theory, you could elect to define your flavors using the BDF attribute but determining 'the card in this slot is equivalent to all the other cards in the same slot in other machines' is probably not the best idea... We could lock it out as an option or we could just assume that administrators wouldn't be daft enough to try. * the compute node needs to know the PCI flavor. [...] - to support live migration, we need to use it to create network xml I didn't understand this at first and it took me a while to get what Robert meant here. This is based on Robert's current code for macvtap based live migration. The issue is that if you wish to migrate a VM and it's tied to a physical interface, you can't guarantee that the same physical interface is going to be used on the target machine, but at the same time you can't change the libvirt.xml as it comes over with the migrating machine. The answer is to define a network and refer out to it from libvirt.xml. In Robert's current code he's using the group name of the PCI devices to create a network containing the list of equivalent devices (those in the group) that can be macvtapped. Thus when the host migrates it will find another, equivalent, interface. This falls over in the use case under consideration where a device can be mapped using more than one flavor, so we have to discard the use case or rethink the implementation. There's a more complex solution - I think - where we create a temporary network for each macvtap interface a machine's going to use, with a name based on the instance UUID and port number, and containing the device to map. Before starting the migration we would create a replacement network containing only the new device on the target host; migration would find the network from the name in the libvirt.xml, and the content of that network would behave identically. We'd be creating libvirt networks on the fly and a lot more of them, and we'd need decent cleanup code too ('when freeing a PCI device, delete any network it's a member of'), so it all becomes a lot more hairy. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev Ian/Robert, below is my understanding to the method Robet want to use, am I right? a) Define a libvirt network as Using a macvtap direct connection section at http://libvirt.org/formatnetwork.html . For example, like followed one: network name group_name1 /name forward mode=bridge interface dev=eth20/ interface dev=eth21/ interface dev=eth22/ interface dev=eth23/ interface dev=eth24/ /forward /network b) When assign SRIOV NIC devices to an instance, as in Assignment from a pool of SRIOV VFs in a libvirt network definition section in http://wiki.libvirt.org/page/Networking#PCI_Passthrough_of_host_network_devices , use libvirt network definition group_name1. For example, like followed one: interface type='network' source network='group_name1' /interface If my understanding is correct, then I have something unclear yet: a) How will the libvirt create the libvirt network (i.e. libvirt network group_name1)? Will it has be created when compute boot up, or it will be created before instance creation? I suppose per Robert's design, it's created when compute node is up, am I right? b) If all the interface are used up by instance, what will happen. Considering that 4 interface allocated to the group_name1 libvirt network, and user try to migrate 6 instance with 'group_name1' network, what will happen? And below is my comments: a) Yes, this is in fact different with the current nova PCI support philosophy. Currently we assume Nova owns the devices, manage the device assignment to each instance. While in such situation, libvirt network is in fact another layer of PCI device management layer (although very thin) ! b) This also remind me that possibly other VMM like XenAPI has special requirement and we need input/confirmation from them also. As how to resolve the issue, I think there are several solution: a) Create one libvirt network for each SRIOV NIC assigned to each instance dynamic, i.e. the libvirt network always has only one interface included, it may be static created or dynamical created. This solution in fact removes the
[openstack-dev] [nova] [neutron] PCI pass-through network support
Hi Folks, In light of today's IRC meeting, and for the purpose of moving this forward, I'm fine to go with the following if that's what everyone wants to go with: https://docs.google.com/document/d/1vadqmurlnlvZ5bv3BlUbFeXRS_wh-dsgi5plSjimWjU/edit But with some concerns and reservations. --- I don't expect everyone to agree on this. But I think the proposal is much more complicated in terms of implementation and administration. --- I'd like to see a practical deployment scenario in which only PCI flavor can support, but PCI group can't, which I guess can justify the complexities. --- do we agree that BDF address (or device id, whatever you call it), and node id shouldn't be used as attributes in defining a PCI flavor? --- I'd like to see a detailed (not vague) design on the following: * PCI stats report since the scheduler is stats based * the scheduler in support of PCI flavors with arbitrary attributes. --- I'd like to see how this can be mapped into SRIOV support: * the compute node needs to know the PCI flavor. A couple of reasons for this: - the neutron plugin may need this to associate with a particular subsystem (or physical network) - to support live migration, we need to use it to create network xml * We also need to be able to do auto discovery so that we can support live migration with SRIOV * use the PCI flavor in the —nic option and neutron commands --- Just want to point out that this PCI flavor doesn't seem to be the same PCI flavor that Join was talking about in one of his emails. I'd like to also point out that if you consider a PCI group as an attribute (in terms of the proposal), then the PCI group design is a special (or degenerated) case of the proposed design. The significant difference here is that with PCI group, its semantics is clear and well defined, and everything else works on top of it. An attribute is arbitrary and open for interpretation. In terms of getting things done ASAP, the PCI group is actually the way to go. I guess that we will take a phased approach to implement it so that we can get something done in Icehouse. However, I'd like to see that neutron requirements one way or the other can be satisfied in the first phase. Maybe we can continue the IRC tomorrow and talk about the above. Again, let's move on if that's really where we want to go. thanks, Robert ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
To clarify a couple of Robert's points, since we had a conversation earlier: On 15 January 2014 23:47, Robert Li (baoli) ba...@cisco.com wrote: --- do we agree that BDF address (or device id, whatever you call it), and node id shouldn't be used as attributes in defining a PCI flavor? Note that the current spec doesn't actually exclude it as an option. It's just an unwise thing to do. In theory, you could elect to define your flavors using the BDF attribute but determining 'the card in this slot is equivalent to all the other cards in the same slot in other machines' is probably not the best idea... We could lock it out as an option or we could just assume that administrators wouldn't be daft enough to try. * the compute node needs to know the PCI flavor. [...] - to support live migration, we need to use it to create network xml I didn't understand this at first and it took me a while to get what Robert meant here. This is based on Robert's current code for macvtap based live migration. The issue is that if you wish to migrate a VM and it's tied to a physical interface, you can't guarantee that the same physical interface is going to be used on the target machine, but at the same time you can't change the libvirt.xml as it comes over with the migrating machine. The answer is to define a network and refer out to it from libvirt.xml. In Robert's current code he's using the group name of the PCI devices to create a network containing the list of equivalent devices (those in the group) that can be macvtapped. Thus when the host migrates it will find another, equivalent, interface. This falls over in the use case under consideration where a device can be mapped using more than one flavor, so we have to discard the use case or rethink the implementation. There's a more complex solution - I think - where we create a temporary network for each macvtap interface a machine's going to use, with a name based on the instance UUID and port number, and containing the device to map. Before starting the migration we would create a replacement network containing only the new device on the target host; migration would find the network from the name in the libvirt.xml, and the content of that network would behave identically. We'd be creating libvirt networks on the fly and a lot more of them, and we'd need decent cleanup code too ('when freeing a PCI device, delete any network it's a member of'), so it all becomes a lot more hairy. -- Ian. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi, After having a lot of discussions both on IRC and mailing list, I would like to suggest to define basic use cases for PCI pass-through network support with agreed list of limitations and assumptions and implement it. By doing this Proof of Concept we will be able to deliver basic PCI pass-through network support in Icehouse timeframe and understand better how to provide complete solution starting from tenant /admin API enhancement, enhancing nova-neutron communication and eventually provide neutron plugin supporting the PCI pass-through networking. We can try to split tasks between currently involved participants and bring up the basic case. Then we can enhance the implementation. Having more knowledge and experience with neutron parts, I would like to start working on neutron mechanism driver support. I have already started to arrange the following blueprint doc based on everyone's ideas: https://docs.google.com/document/d/1RfxfXBNB0mD_kH9SamwqPI8ZM-jg797ky_Fze7SakRc/edit#https://docs.google.com/document/d/1RfxfXBNB0mD_kH9SamwqPI8ZM-jg797ky_Fze7SakRc/edit For the basic PCI pass-through networking case we can assume the following: 1. Single provider network (PN1) 2. White list of available SRIOV PCI devices for allocation as NIC for neutron networks on provider network (PN1) is defined on each compute node 3. Support directly assigned SRIOV PCI pass-through device as vNIC. (This will limit the number of tests) 4. More If my suggestion seems reasonable to you, let's try to reach an agreement and split the work during our Monday IRC meeting. BR, Irena From: Jiang, Yunhong [mailto:yunhong.ji...@intel.com] Sent: Saturday, January 11, 2014 8:36 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Comments with prefix [yjiang5_2] , including the double confirm. I think we (you and me) is mostly on the same page, would you please give a summary, and then we can have community , including Irena/Robert, to check it. We need Cores to sponsor it. We should check with John to see if this is different with his mentor picture, and we may need a neutron core (I assume Cisco has a bunch of Neutron cores :) )to sponsor it? And, will anyone from Cisco can help on the implementation? After this long discussion, we are in half bottom of I release and I'm not sure if Yongli and I alone can finish them in I release. Thanks --jyh From: Ian Wells [mailto:ijw.ubu...@cack.org.uk] Sent: Friday, January 10, 2014 6:34 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support OK - so if this is good then I think the question is how we could change the 'pci_whitelist' parameter we have - which, as you say, should either *only* do whitelisting or be renamed - to allow us to add information. Yongli has something along those lines but it's not flexible and it distinguishes poorly between which bits are extra information and which bits are matching expressions (and it's still called pci_whitelist) - but even with those criticisms it's very close to what we're talking about. When we have that I think a lot of the rest of the arguments should simply resolve themselves. [yjiang5_1] The reason that not easy to find a flexible/distinguishable change to pci_whitelist is because it combined two things. So a stupid/naive solution in my head is, change it to VERY generic name, 'pci_devices_information', and change schema as an array of {'devices_property'=regex exp, 'group_name' = 'g1'} dictionary, and the device_property expression can be 'address ==xxx, vendor_id == xxx' (i.e. similar with current white list), and we can squeeze more into the pci_devices_information in future, like 'network_information' = xxx or Neutron specific information you required in previous mail. We're getting to the stage that an expression parser would be useful, annoyingly, but if we are going to try and squeeze it into JSON can I suggest: { match = { class = Acme inc. discombobulator }, info = { group = we like teh groups, volume = 11 } } [yjiang5_2] Double confirm that 'match' is whitelist, and info is 'extra info', right? Can the key be more meaningful, for example, s/match/pci_device_property, s/info/pci_device_info, or s/match/pci_devices/ etc. Also assume the class should be the class code in the configuration space, and be digital, am I right? Otherwise, it's not easy to get the 'Acme inc. discombobulator' information. All keys other than 'device_property' becomes extra information, i.e. software defined property. These extra information will be carried with the PCI devices,. Some implementation details, A)we can limit the acceptable keys, like we only support 'group_name', 'network_id', or we can accept any keys other than reserved (vendor_id
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Irena, have a word with Bob (rkukura on IRC, East coast), he was talking about what would be needed already and should be able to help you. Conveniently he's also core. ;) -- Ian. On 12 January 2014 22:12, Irena Berezovsky ire...@mellanox.com wrote: Hi John, Thank you for taking an initiative and summing up the work that need to be done to provide PCI pass-through network support. The only item I think is missing is the neutron support for PCI pass-through. Currently we have Mellanox Plugin that supports PCI pass-through assuming Mellanox Adapter card embedded switch technology. But in order to have fully integrated PCI pass-through networking support for the use cases Robert listed on previous mail, the generic neutron PCI pass-through support is required. This can be enhanced with vendor specific task that may differ (Mellanox Embedded switch vs Cisco 802.1BR), but there is still common part of being PCI aware mechanism driver. I have already started with definition for this part: https://docs.google.com/document/d/1RfxfXBNB0mD_kH9SamwqPI8ZM-jg797ky_Fze7SakRc/edit# I also plan to start coding soon. Depends on how it goes, I can take also nova parts that integrate with neutron APIs from item 3. Regards, Irena -Original Message- From: John Garbutt [mailto:j...@johngarbutt.com] Sent: Friday, January 10, 2014 4:34 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Apologies for this top post, I just want to move this discussion towards action. I am traveling next week so it is unlikely that I can make the meetings. Sorry. Can we please agree on some concrete actions, and who will do the coding? This also means raising new blueprints for each item of work. I am happy to review and eventually approve those blueprints, if you email me directly. Ideas are taken from what we started to agree on, mostly written up here: https://wiki.openstack.org/wiki/Meetings/Passthrough#Definitions What doesn't need doing... We have PCI whitelist and PCI alias at the moment, let keep those names the same for now. I personally prefer PCI-flavor, rather than PCI-alias, but lets discuss any rename separately. We seemed happy with the current system (roughly) around GPU passthrough: nova flavor-key three_GPU_attached_30GB set pci_passthrough:alias= large_GPU:1,small_GPU:2 nova boot --image some_image --flavor three_GPU_attached_30GB some_name Again, we seemed happy with the current PCI whitelist. Sure, we could optimise the scheduling, but again, please keep that a separate discussion. Something in the scheduler needs to know how many of each PCI alias are available on each host. How that information gets there can be change at a later date. PCI alias is in config, but its probably better defined using host aggregates, or some custom API. But lets leave that for now, and discuss it separately. If the need arrises, we can migrate away from the config. What does need doing... == 1) API CLI changes for nic-type, and associated tempest tests * Add a user visible nic-type so users can express on of several network types. * We need a default nic-type, for when the user doesn't specify one (might default to SRIOV in some cases) * We can easily test the case where the default is virtual and the user expresses a preference for virtual * Above is much better than not testing it at all. nova boot --flavor m1.large --image image_id --nic net-id=net-id-1 --nic net-id=net-id-2,nic-type=fast --nic net-id=net-id-3,nic-type=fast vm-name or neutron port-create --fixed-ip subnet_id=subnet-id,ip_address=192.168.57.101 --nic-type=slow | fast | foobar net-id nova boot --flavor m1.large --image image_id --nic port-id=port-id Where nic-type is just an extra bit metadata string that is passed to nova and the VIF driver. 2) Expand PCI alias information We need extensions to PCI alias so we can group SRIOV devices better. I still think we are yet to agree on a format, but I would suggest this as a starting point: { name:GPU_fast, devices:[ {vendor_id:1137,product_id:0071, address:*, attach-type:direct}, {vendor_id:1137,product_id:0072, address:*, attach-type:direct} ], sriov_info: {} } { name:NIC_fast, devices:[ {vendor_id:1137,product_id:0071, address:0:[1-50]:2:*, attach-type:macvtap} {vendor_id:1234,product_id:0081, address:*, attach-type:direct} ], sriov_info: { nic_type:fast, network_ids: [net-id-1, net-id-2] } } { name:NIC_slower, devices:[ {vendor_id:1137,product_id:0071, address:*, attach-type:direct} {vendor_id:1234,product_id:0081, address:*, attach-type:direct} ], sriov_info: { nic_type:fast, network_ids: [*] # this means could attach to any network } } The idea being the VIF driver gets passed this info, when
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Ian, It's great news. Thank you for bringing Bob's attention to this effort. I'll look for Bob on IRC to get the details. And of course, core support raises our chances to make PCI pass-through networking into icehouse. BR, Irena From: Ian Wells [mailto:ijw.ubu...@cack.org.uk] Sent: Monday, January 13, 2014 2:02 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Irena, have a word with Bob (rkukura on IRC, East coast), he was talking about what would be needed already and should be able to help you. Conveniently he's also core. ;) -- Ian. On 12 January 2014 22:12, Irena Berezovsky ire...@mellanox.commailto:ire...@mellanox.com wrote: Hi John, Thank you for taking an initiative and summing up the work that need to be done to provide PCI pass-through network support. The only item I think is missing is the neutron support for PCI pass-through. Currently we have Mellanox Plugin that supports PCI pass-through assuming Mellanox Adapter card embedded switch technology. But in order to have fully integrated PCI pass-through networking support for the use cases Robert listed on previous mail, the generic neutron PCI pass-through support is required. This can be enhanced with vendor specific task that may differ (Mellanox Embedded switch vs Cisco 802.1BR), but there is still common part of being PCI aware mechanism driver. I have already started with definition for this part: https://docs.google.com/document/d/1RfxfXBNB0mD_kH9SamwqPI8ZM-jg797ky_Fze7SakRc/edit#https://docs.google.com/document/d/1RfxfXBNB0mD_kH9SamwqPI8ZM-jg797ky_Fze7SakRc/edit I also plan to start coding soon. Depends on how it goes, I can take also nova parts that integrate with neutron APIs from item 3. Regards, Irena -Original Message- From: John Garbutt [mailto:j...@johngarbutt.commailto:j...@johngarbutt.com] Sent: Friday, January 10, 2014 4:34 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Apologies for this top post, I just want to move this discussion towards action. I am traveling next week so it is unlikely that I can make the meetings. Sorry. Can we please agree on some concrete actions, and who will do the coding? This also means raising new blueprints for each item of work. I am happy to review and eventually approve those blueprints, if you email me directly. Ideas are taken from what we started to agree on, mostly written up here: https://wiki.openstack.org/wiki/Meetings/Passthrough#Definitions What doesn't need doing... We have PCI whitelist and PCI alias at the moment, let keep those names the same for now. I personally prefer PCI-flavor, rather than PCI-alias, but lets discuss any rename separately. We seemed happy with the current system (roughly) around GPU passthrough: nova flavor-key three_GPU_attached_30GB set pci_passthrough:alias= large_GPU:1,small_GPU:2 nova boot --image some_image --flavor three_GPU_attached_30GB some_name Again, we seemed happy with the current PCI whitelist. Sure, we could optimise the scheduling, but again, please keep that a separate discussion. Something in the scheduler needs to know how many of each PCI alias are available on each host. How that information gets there can be change at a later date. PCI alias is in config, but its probably better defined using host aggregates, or some custom API. But lets leave that for now, and discuss it separately. If the need arrises, we can migrate away from the config. What does need doing... == 1) API CLI changes for nic-type, and associated tempest tests * Add a user visible nic-type so users can express on of several network types. * We need a default nic-type, for when the user doesn't specify one (might default to SRIOV in some cases) * We can easily test the case where the default is virtual and the user expresses a preference for virtual * Above is much better than not testing it at all. nova boot --flavor m1.large --image image_id --nic net-id=net-id-1 --nic net-id=net-id-2,nic-type=fast --nic net-id=net-id-3,nic-type=fast vm-name or neutron port-create --fixed-ip subnet_id=subnet-id,ip_address=192.168.57.101 --nic-type=slow | fast | foobar net-id nova boot --flavor m1.large --image image_id --nic port-id=port-id Where nic-type is just an extra bit metadata string that is passed to nova and the VIF driver. 2) Expand PCI alias information We need extensions to PCI alias so we can group SRIOV devices better. I still think we are yet to agree on a format, but I would suggest this as a starting point: { name:GPU_fast, devices:[ {vendor_id:1137,product_id:0071, address:*, attach-type:direct}, {vendor_id:1137,product_id:0072, address:*, attach-type:direct} ], sriov_info: {} } { name:NIC_fast, devices:[ {vendor_id:1137,product_id:0071
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
As I have responded in the other email, and If I understand PCI flavor correctly, then the issue that we need to deal with is the overlapping issue. A simplest case of this overlapping is that you can define a flavor F1 as [vendor_id='v', product_id='p'], and a flavor F2 as [vendor_id = 'v'] . Let's assume that only the admin can define the flavors. It's not hard to see that a device can belong to the two different flavors in the same time. This introduces an issue in the scheduler. Suppose the scheduler (counts or stats based) maintains counts based on flavors (or the keys corresponding to the flavors). To request a device with the flavor F1, counts in F2 needs to be subtracted by one as well. There may be several ways to achieve that. But regardless, it introduces tremendous overhead in terms of system processing and administrative costs. What are the use cases for that? How practical are those use cases? thanks, Robert On 1/10/14 9:34 PM, Ian Wells ijw.ubu...@cack.org.ukmailto:ijw.ubu...@cack.org.uk wrote: OK - so if this is good then I think the question is how we could change the 'pci_whitelist' parameter we have - which, as you say, should either *only* do whitelisting or be renamed - to allow us to add information. Yongli has something along those lines but it's not flexible and it distinguishes poorly between which bits are extra information and which bits are matching expressions (and it's still called pci_whitelist) - but even with those criticisms it's very close to what we're talking about. When we have that I think a lot of the rest of the arguments should simply resolve themselves. [yjiang5_1] The reason that not easy to find a flexible/distinguishable change to pci_whitelist is because it combined two things. So a stupid/naive solution in my head is, change it to VERY generic name, ‘pci_devices_information’, and change schema as an array of {‘devices_property’=regex exp, ‘group_name’ = ‘g1’} dictionary, and the device_property expression can be ‘address ==xxx, vendor_id == xxx’ (i.e. similar with current white list), and we can squeeze more into the “pci_devices_information” in future, like ‘network_information’ = xxx or “Neutron specific information” you required in previous mail. We're getting to the stage that an expression parser would be useful, annoyingly, but if we are going to try and squeeze it into JSON can I suggest: { match = { class = Acme inc. discombobulator }, info = { group = we like teh groups, volume = 11 } } All keys other than ‘device_property’ becomes extra information, i.e. software defined property. These extra information will be carried with the PCI devices,. Some implementation details, A)we can limit the acceptable keys, like we only support ‘group_name’, ‘network_id’, or we can accept any keys other than reserved (vendor_id, device_id etc) one. Not sure we have a good list of reserved keys at the moment, and with two dicts it isn't really necessary, I guess. I would say that we have one match parser which looks something like this: # does this PCI device match the expression given? def match(expression, pci_details, extra_specs): for (k, v) in expression: if k.starts_with('e.'): mv = extra_specs.get(k[2:]) else: mv = pci_details.get(k[2:]) if not match(m, mv): return False return True Usable in this matching (where 'e.' just won't work) and also for flavor assignment (where e. will indeed match the extra values). B) if a device match ‘device_property’ in several entries, raise exception, or use the first one. Use the first one, I think. It's easier, and potentially more useful. [yjiang5_1] Another thing need discussed is, as you pointed out, “we would need to add a config param on the control host to decide which flags to group on when doing the stats”. I agree with the design, but some details need decided. This is a patch that can come at any point after we do the above stuff (which we need for Neutron), clearly. Where should it defined. If we a) define it in both control node and compute node, then it should be static defined (just change pool_keys in /opt/stack/nova/nova/pci/pci_stats.py to a configuration parameter) . Or b) define only in control node, then I assume the control node should be the scheduler node, and the scheduler manager need save such information, present a API to fetch such information and the compute node need fetch it on every update_available_resource() periodic task. I’d prefer to take a) option in first step. Your idea? I think it has to be (a), which is a shame. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi, Robert, scheduler keep count based on pci_stats instead of the pci flavor. As stated by Ian at https://www.mail-archive.com/openstack-dev@lists.openstack.org/msg13455.html already, the flavor will only use the tags used by pci_stats. Thanks --jyh From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Monday, January 13, 2014 8:22 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support As I have responded in the other email, and If I understand PCI flavor correctly, then the issue that we need to deal with is the overlapping issue. A simplest case of this overlapping is that you can define a flavor F1 as [vendor_id='v', product_id='p'], and a flavor F2 as [vendor_id = 'v'] . Let's assume that only the admin can define the flavors. It's not hard to see that a device can belong to the two different flavors in the same time. This introduces an issue in the scheduler. Suppose the scheduler (counts or stats based) maintains counts based on flavors (or the keys corresponding to the flavors). To request a device with the flavor F1, counts in F2 needs to be subtracted by one as well. There may be several ways to achieve that. But regardless, it introduces tremendous overhead in terms of system processing and administrative costs. What are the use cases for that? How practical are those use cases? thanks, Robert On 1/10/14 9:34 PM, Ian Wells ijw.ubu...@cack.org.ukmailto:ijw.ubu...@cack.org.uk wrote: OK - so if this is good then I think the question is how we could change the 'pci_whitelist' parameter we have - which, as you say, should either *only* do whitelisting or be renamed - to allow us to add information. Yongli has something along those lines but it's not flexible and it distinguishes poorly between which bits are extra information and which bits are matching expressions (and it's still called pci_whitelist) - but even with those criticisms it's very close to what we're talking about. When we have that I think a lot of the rest of the arguments should simply resolve themselves. [yjiang5_1] The reason that not easy to find a flexible/distinguishable change to pci_whitelist is because it combined two things. So a stupid/naive solution in my head is, change it to VERY generic name, 'pci_devices_information', and change schema as an array of {'devices_property'=regex exp, 'group_name' = 'g1'} dictionary, and the device_property expression can be 'address ==xxx, vendor_id == xxx' (i.e. similar with current white list), and we can squeeze more into the pci_devices_information in future, like 'network_information' = xxx or Neutron specific information you required in previous mail. We're getting to the stage that an expression parser would be useful, annoyingly, but if we are going to try and squeeze it into JSON can I suggest: { match = { class = Acme inc. discombobulator }, info = { group = we like teh groups, volume = 11 } } All keys other than 'device_property' becomes extra information, i.e. software defined property. These extra information will be carried with the PCI devices,. Some implementation details, A)we can limit the acceptable keys, like we only support 'group_name', 'network_id', or we can accept any keys other than reserved (vendor_id, device_id etc) one. Not sure we have a good list of reserved keys at the moment, and with two dicts it isn't really necessary, I guess. I would say that we have one match parser which looks something like this: # does this PCI device match the expression given? def match(expression, pci_details, extra_specs): for (k, v) in expression: if k.starts_with('e.'): mv = extra_specs.get(k[2:]) else: mv = pci_details.get(k[2:]) if not match(m, mv): return False return True Usable in this matching (where 'e.' just won't work) and also for flavor assignment (where e. will indeed match the extra values). B) if a device match 'device_property' in several entries, raise exception, or use the first one. Use the first one, I think. It's easier, and potentially more useful. [yjiang5_1] Another thing need discussed is, as you pointed out, we would need to add a config param on the control host to decide which flags to group on when doing the stats. I agree with the design, but some details need decided. This is a patch that can come at any point after we do the above stuff (which we need for Neutron), clearly. Where should it defined. If we a) define it in both control node and compute node, then it should be static defined (just change pool_keys in /opt/stack/nova/nova/pci/pci_stats.py to a configuration parameter) . Or b) define only in control node, then I assume the control node should be the scheduler node, and the scheduler manager need save such information, present a API to fetch such information
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
It's worth noting that this makes the scheduling a computationally hard problem. The answer to that in this scheme is to reduce the number of inputs to trivialise the problem. It's going to be O(f(number of flavor types requested, number of pci_stats pools)) and if you group appropriately there shouldn't be an excessive number of pci_stats pools. I am not going to stand up and say this makes it achievable - and if it doesn't them I'm not sure that anything would make overlapping flavors achievable - but I think it gives us some hope. -- Ian. On 13 January 2014 19:27, Jiang, Yunhong yunhong.ji...@intel.com wrote: Hi, Robert, scheduler keep count based on pci_stats instead of the pci flavor. As stated by Ian at https://www.mail-archive.com/openstack-dev@lists.openstack.org/msg13455.htmlalready, the flavor will only use the tags used by pci_stats. Thanks --jyh *From:* Robert Li (baoli) [mailto:ba...@cisco.com] *Sent:* Monday, January 13, 2014 8:22 AM *To:* OpenStack Development Mailing List (not for usage questions) *Subject:* Re: [openstack-dev] [nova] [neutron] PCI pass-through network support As I have responded in the other email, and If I understand PCI flavor correctly, then the issue that we need to deal with is the overlapping issue. A simplest case of this overlapping is that you can define a flavor F1 as [vendor_id='v', product_id='p'], and a flavor F2 as [vendor_id = 'v'] . Let's assume that only the admin can define the flavors. It's not hard to see that a device can belong to the two different flavors in the same time. This introduces an issue in the scheduler. Suppose the scheduler (counts or stats based) maintains counts based on flavors (or the keys corresponding to the flavors). To request a device with the flavor F1, counts in F2 needs to be subtracted by one as well. There may be several ways to achieve that. But regardless, it introduces tremendous overhead in terms of system processing and administrative costs. What are the use cases for that? How practical are those use cases? thanks, Robert On 1/10/14 9:34 PM, Ian Wells ijw.ubu...@cack.org.uk wrote: OK - so if this is good then I think the question is how we could change the 'pci_whitelist' parameter we have - which, as you say, should either *only* do whitelisting or be renamed - to allow us to add information. Yongli has something along those lines but it's not flexible and it distinguishes poorly between which bits are extra information and which bits are matching expressions (and it's still called pci_whitelist) - but even with those criticisms it's very close to what we're talking about. When we have that I think a lot of the rest of the arguments should simply resolve themselves. [yjiang5_1] The reason that not easy to find a flexible/distinguishable change to pci_whitelist is because it combined two things. So a stupid/naive solution in my head is, change it to VERY generic name, ‘pci_devices_information’, and change schema as an array of {‘devices_property’=regex exp, ‘group_name’ = ‘g1’} dictionary, and the device_property expression can be ‘address ==xxx, vendor_id == xxx’ (i.e. similar with current white list), and we can squeeze more into the “pci_devices_information” in future, like ‘network_information’ = xxx or “Neutron specific information” you required in previous mail. We're getting to the stage that an expression parser would be useful, annoyingly, but if we are going to try and squeeze it into JSON can I suggest: { match = { class = Acme inc. discombobulator }, info = { group = we like teh groups, volume = 11 } } All keys other than ‘device_property’ becomes extra information, i.e. software defined property. These extra information will be carried with the PCI devices,. Some implementation details, A)we can limit the acceptable keys, like we only support ‘group_name’, ‘network_id’, or we can accept any keys other than reserved (vendor_id, device_id etc) one. Not sure we have a good list of reserved keys at the moment, and with two dicts it isn't really necessary, I guess. I would say that we have one match parser which looks something like this: # does this PCI device match the expression given? def match(expression, pci_details, extra_specs): for (k, v) in expression: if k.starts_with('e.'): mv = extra_specs.get(k[2:]) else: mv = pci_details.get(k[2:]) if not match(m, mv): return False return True Usable in this matching (where 'e.' just won't work) and also for flavor assignment (where e. will indeed match the extra values). B) if a device match ‘device_property’ in several entries, raise exception, or use the first one. Use the first one, I think. It's easier, and potentially more useful. [yjiang5_1] Another thing need discussed is, as you pointed out, “we would need to add a config param on the control
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
I'm not network engineer and always lost at 802.1Qbh/802.1BR specs :( So I'd wait for requirement from Neutron. A quick check seems my discussion with Ian meet the requirement already? Thanks --jyh From: Irena Berezovsky [mailto:ire...@mellanox.com] Sent: Monday, January 13, 2014 12:51 AM To: OpenStack Development Mailing List (not for usage questions) Cc: Jiang, Yunhong; He, Yongli; Robert Li (baoli) (ba...@cisco.com); Sandhya Dasu (sadasu) (sad...@cisco.com); ijw.ubu...@cack.org.uk; j...@johngarbutt.com Subject: RE: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi, After having a lot of discussions both on IRC and mailing list, I would like to suggest to define basic use cases for PCI pass-through network support with agreed list of limitations and assumptions and implement it. By doing this Proof of Concept we will be able to deliver basic PCI pass-through network support in Icehouse timeframe and understand better how to provide complete solution starting from tenant /admin API enhancement, enhancing nova-neutron communication and eventually provide neutron plugin supporting the PCI pass-through networking. We can try to split tasks between currently involved participants and bring up the basic case. Then we can enhance the implementation. Having more knowledge and experience with neutron parts, I would like to start working on neutron mechanism driver support. I have already started to arrange the following blueprint doc based on everyone's ideas: https://docs.google.com/document/d/1RfxfXBNB0mD_kH9SamwqPI8ZM-jg797ky_Fze7SakRc/edit#https://docs.google.com/document/d/1RfxfXBNB0mD_kH9SamwqPI8ZM-jg797ky_Fze7SakRc/edit For the basic PCI pass-through networking case we can assume the following: 1. Single provider network (PN1) 2. White list of available SRIOV PCI devices for allocation as NIC for neutron networks on provider network (PN1) is defined on each compute node 3. Support directly assigned SRIOV PCI pass-through device as vNIC. (This will limit the number of tests) 4. More If my suggestion seems reasonable to you, let's try to reach an agreement and split the work during our Monday IRC meeting. BR, Irena From: Jiang, Yunhong [mailto:yunhong.ji...@intel.com] Sent: Saturday, January 11, 2014 8:36 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Comments with prefix [yjiang5_2] , including the double confirm. I think we (you and me) is mostly on the same page, would you please give a summary, and then we can have community , including Irena/Robert, to check it. We need Cores to sponsor it. We should check with John to see if this is different with his mentor picture, and we may need a neutron core (I assume Cisco has a bunch of Neutron cores :) )to sponsor it? And, will anyone from Cisco can help on the implementation? After this long discussion, we are in half bottom of I release and I'm not sure if Yongli and I alone can finish them in I release. Thanks --jyh From: Ian Wells [mailto:ijw.ubu...@cack.org.uk] Sent: Friday, January 10, 2014 6:34 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support OK - so if this is good then I think the question is how we could change the 'pci_whitelist' parameter we have - which, as you say, should either *only* do whitelisting or be renamed - to allow us to add information. Yongli has something along those lines but it's not flexible and it distinguishes poorly between which bits are extra information and which bits are matching expressions (and it's still called pci_whitelist) - but even with those criticisms it's very close to what we're talking about. When we have that I think a lot of the rest of the arguments should simply resolve themselves. [yjiang5_1] The reason that not easy to find a flexible/distinguishable change to pci_whitelist is because it combined two things. So a stupid/naive solution in my head is, change it to VERY generic name, 'pci_devices_information', and change schema as an array of {'devices_property'=regex exp, 'group_name' = 'g1'} dictionary, and the device_property expression can be 'address ==xxx, vendor_id == xxx' (i.e. similar with current white list), and we can squeeze more into the pci_devices_information in future, like 'network_information' = xxx or Neutron specific information you required in previous mail. We're getting to the stage that an expression parser would be useful, annoyingly, but if we are going to try and squeeze it into JSON can I suggest: { match = { class = Acme inc. discombobulator }, info = { group = we like teh groups, volume = 11 } } [yjiang5_2] Double confirm that 'match' is whitelist, and info is 'extra info', right? Can the key be more meaningful, for example, s/match
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Ian, not sure if I get your question. Why should scheduler get the number of flavor types requested? The scheduler will only translate the PCI flavor to the pci property match requirement like it does now, (either vendor_id, device_id, or item in extra_info), then match the translated pci flavor, i.e. pci requests, to the pci stats. Thanks --jyh From: Ian Wells [mailto:ijw.ubu...@cack.org.uk] Sent: Monday, January 13, 2014 10:57 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support It's worth noting that this makes the scheduling a computationally hard problem. The answer to that in this scheme is to reduce the number of inputs to trivialise the problem. It's going to be O(f(number of flavor types requested, number of pci_stats pools)) and if you group appropriately there shouldn't be an excessive number of pci_stats pools. I am not going to stand up and say this makes it achievable - and if it doesn't them I'm not sure that anything would make overlapping flavors achievable - but I think it gives us some hope. -- Ian. On 13 January 2014 19:27, Jiang, Yunhong yunhong.ji...@intel.commailto:yunhong.ji...@intel.com wrote: Hi, Robert, scheduler keep count based on pci_stats instead of the pci flavor. As stated by Ian at https://www.mail-archive.com/openstack-dev@lists.openstack.org/msg13455.html already, the flavor will only use the tags used by pci_stats. Thanks --jyh From: Robert Li (baoli) [mailto:ba...@cisco.commailto:ba...@cisco.com] Sent: Monday, January 13, 2014 8:22 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support As I have responded in the other email, and If I understand PCI flavor correctly, then the issue that we need to deal with is the overlapping issue. A simplest case of this overlapping is that you can define a flavor F1 as [vendor_id='v', product_id='p'], and a flavor F2 as [vendor_id = 'v'] . Let's assume that only the admin can define the flavors. It's not hard to see that a device can belong to the two different flavors in the same time. This introduces an issue in the scheduler. Suppose the scheduler (counts or stats based) maintains counts based on flavors (or the keys corresponding to the flavors). To request a device with the flavor F1, counts in F2 needs to be subtracted by one as well. There may be several ways to achieve that. But regardless, it introduces tremendous overhead in terms of system processing and administrative costs. What are the use cases for that? How practical are those use cases? thanks, Robert On 1/10/14 9:34 PM, Ian Wells ijw.ubu...@cack.org.ukmailto:ijw.ubu...@cack.org.uk wrote: OK - so if this is good then I think the question is how we could change the 'pci_whitelist' parameter we have - which, as you say, should either *only* do whitelisting or be renamed - to allow us to add information. Yongli has something along those lines but it's not flexible and it distinguishes poorly between which bits are extra information and which bits are matching expressions (and it's still called pci_whitelist) - but even with those criticisms it's very close to what we're talking about. When we have that I think a lot of the rest of the arguments should simply resolve themselves. [yjiang5_1] The reason that not easy to find a flexible/distinguishable change to pci_whitelist is because it combined two things. So a stupid/naive solution in my head is, change it to VERY generic name, 'pci_devices_information', and change schema as an array of {'devices_property'=regex exp, 'group_name' = 'g1'} dictionary, and the device_property expression can be 'address ==xxx, vendor_id == xxx' (i.e. similar with current white list), and we can squeeze more into the pci_devices_information in future, like 'network_information' = xxx or Neutron specific information you required in previous mail. We're getting to the stage that an expression parser would be useful, annoyingly, but if we are going to try and squeeze it into JSON can I suggest: { match = { class = Acme inc. discombobulator }, info = { group = we like teh groups, volume = 11 } } All keys other than 'device_property' becomes extra information, i.e. software defined property. These extra information will be carried with the PCI devices,. Some implementation details, A)we can limit the acceptable keys, like we only support 'group_name', 'network_id', or we can accept any keys other than reserved (vendor_id, device_id etc) one. Not sure we have a good list of reserved keys at the moment, and with two dicts it isn't really necessary, I guess. I would say that we have one match parser which looks something like this: # does this PCI device match the expression given? def match(expression, pci_details, extra_specs): for (k, v) in expression
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
If there are N flavor types there are N match expressions so I think it's pretty much equivalent in terms of complexity. It looks like some sort of packing problem to me, trying to fit N objects into M boxes, hence my statement that it's not going to be easy, but that's just a gut feeling - some of the matches can be vague, such as only the vendor ID or a vendor and two device types, so it's not as simple as one flavor matching one stats row. -- Ian. On 13 January 2014 21:00, Jiang, Yunhong yunhong.ji...@intel.com wrote: Ian, not sure if I get your question. Why should scheduler get the number of flavor types requested? The scheduler will only translate the PCI flavor to the pci property match requirement like it does now, (either vendor_id, device_id, or item in extra_info), then match the translated pci flavor, i.e. pci requests, to the pci stats. Thanks --jyh *From:* Ian Wells [mailto:ijw.ubu...@cack.org.uk] *Sent:* Monday, January 13, 2014 10:57 AM *To:* OpenStack Development Mailing List (not for usage questions) *Subject:* Re: [openstack-dev] [nova] [neutron] PCI pass-through network support It's worth noting that this makes the scheduling a computationally hard problem. The answer to that in this scheme is to reduce the number of inputs to trivialise the problem. It's going to be O(f(number of flavor types requested, number of pci_stats pools)) and if you group appropriately there shouldn't be an excessive number of pci_stats pools. I am not going to stand up and say this makes it achievable - and if it doesn't them I'm not sure that anything would make overlapping flavors achievable - but I think it gives us some hope. -- Ian. On 13 January 2014 19:27, Jiang, Yunhong yunhong.ji...@intel.com wrote: Hi, Robert, scheduler keep count based on pci_stats instead of the pci flavor. As stated by Ian at https://www.mail-archive.com/openstack-dev@lists.openstack.org/msg13455.htmlalready, the flavor will only use the tags used by pci_stats. Thanks --jyh *From:* Robert Li (baoli) [mailto:ba...@cisco.com] *Sent:* Monday, January 13, 2014 8:22 AM *To:* OpenStack Development Mailing List (not for usage questions) *Subject:* Re: [openstack-dev] [nova] [neutron] PCI pass-through network support As I have responded in the other email, and If I understand PCI flavor correctly, then the issue that we need to deal with is the overlapping issue. A simplest case of this overlapping is that you can define a flavor F1 as [vendor_id='v', product_id='p'], and a flavor F2 as [vendor_id = 'v'] . Let's assume that only the admin can define the flavors. It's not hard to see that a device can belong to the two different flavors in the same time. This introduces an issue in the scheduler. Suppose the scheduler (counts or stats based) maintains counts based on flavors (or the keys corresponding to the flavors). To request a device with the flavor F1, counts in F2 needs to be subtracted by one as well. There may be several ways to achieve that. But regardless, it introduces tremendous overhead in terms of system processing and administrative costs. What are the use cases for that? How practical are those use cases? thanks, Robert On 1/10/14 9:34 PM, Ian Wells ijw.ubu...@cack.org.uk wrote: OK - so if this is good then I think the question is how we could change the 'pci_whitelist' parameter we have - which, as you say, should either *only* do whitelisting or be renamed - to allow us to add information. Yongli has something along those lines but it's not flexible and it distinguishes poorly between which bits are extra information and which bits are matching expressions (and it's still called pci_whitelist) - but even with those criticisms it's very close to what we're talking about. When we have that I think a lot of the rest of the arguments should simply resolve themselves. [yjiang5_1] The reason that not easy to find a flexible/distinguishable change to pci_whitelist is because it combined two things. So a stupid/naive solution in my head is, change it to VERY generic name, ‘pci_devices_information’, and change schema as an array of {‘devices_property’=regex exp, ‘group_name’ = ‘g1’} dictionary, and the device_property expression can be ‘address ==xxx, vendor_id == xxx’ (i.e. similar with current white list), and we can squeeze more into the “pci_devices_information” in future, like ‘network_information’ = xxx or “Neutron specific information” you required in previous mail. We're getting to the stage that an expression parser would be useful, annoyingly, but if we are going to try and squeeze it into JSON can I suggest: { match = { class = Acme inc. discombobulator }, info = { group = we like teh groups, volume = 11 } } All keys other than ‘device_property’ becomes extra information, i.e. software defined property. These extra information will be carried with the PCI devices
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi John, Thank you for taking an initiative and summing up the work that need to be done to provide PCI pass-through network support. The only item I think is missing is the neutron support for PCI pass-through. Currently we have Mellanox Plugin that supports PCI pass-through assuming Mellanox Adapter card embedded switch technology. But in order to have fully integrated PCI pass-through networking support for the use cases Robert listed on previous mail, the generic neutron PCI pass-through support is required. This can be enhanced with vendor specific task that may differ (Mellanox Embedded switch vs Cisco 802.1BR), but there is still common part of being PCI aware mechanism driver. I have already started with definition for this part: https://docs.google.com/document/d/1RfxfXBNB0mD_kH9SamwqPI8ZM-jg797ky_Fze7SakRc/edit# I also plan to start coding soon. Depends on how it goes, I can take also nova parts that integrate with neutron APIs from item 3. Regards, Irena -Original Message- From: John Garbutt [mailto:j...@johngarbutt.com] Sent: Friday, January 10, 2014 4:34 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Apologies for this top post, I just want to move this discussion towards action. I am traveling next week so it is unlikely that I can make the meetings. Sorry. Can we please agree on some concrete actions, and who will do the coding? This also means raising new blueprints for each item of work. I am happy to review and eventually approve those blueprints, if you email me directly. Ideas are taken from what we started to agree on, mostly written up here: https://wiki.openstack.org/wiki/Meetings/Passthrough#Definitions What doesn't need doing... We have PCI whitelist and PCI alias at the moment, let keep those names the same for now. I personally prefer PCI-flavor, rather than PCI-alias, but lets discuss any rename separately. We seemed happy with the current system (roughly) around GPU passthrough: nova flavor-key three_GPU_attached_30GB set pci_passthrough:alias= large_GPU:1,small_GPU:2 nova boot --image some_image --flavor three_GPU_attached_30GB some_name Again, we seemed happy with the current PCI whitelist. Sure, we could optimise the scheduling, but again, please keep that a separate discussion. Something in the scheduler needs to know how many of each PCI alias are available on each host. How that information gets there can be change at a later date. PCI alias is in config, but its probably better defined using host aggregates, or some custom API. But lets leave that for now, and discuss it separately. If the need arrises, we can migrate away from the config. What does need doing... == 1) API CLI changes for nic-type, and associated tempest tests * Add a user visible nic-type so users can express on of several network types. * We need a default nic-type, for when the user doesn't specify one (might default to SRIOV in some cases) * We can easily test the case where the default is virtual and the user expresses a preference for virtual * Above is much better than not testing it at all. nova boot --flavor m1.large --image image_id --nic net-id=net-id-1 --nic net-id=net-id-2,nic-type=fast --nic net-id=net-id-3,nic-type=fast vm-name or neutron port-create --fixed-ip subnet_id=subnet-id,ip_address=192.168.57.101 --nic-type=slow | fast | foobar net-id nova boot --flavor m1.large --image image_id --nic port-id=port-id Where nic-type is just an extra bit metadata string that is passed to nova and the VIF driver. 2) Expand PCI alias information We need extensions to PCI alias so we can group SRIOV devices better. I still think we are yet to agree on a format, but I would suggest this as a starting point: { name:GPU_fast, devices:[ {vendor_id:1137,product_id:0071, address:*, attach-type:direct}, {vendor_id:1137,product_id:0072, address:*, attach-type:direct} ], sriov_info: {} } { name:NIC_fast, devices:[ {vendor_id:1137,product_id:0071, address:0:[1-50]:2:*, attach-type:macvtap} {vendor_id:1234,product_id:0081, address:*, attach-type:direct} ], sriov_info: { nic_type:fast, network_ids: [net-id-1, net-id-2] } } { name:NIC_slower, devices:[ {vendor_id:1137,product_id:0071, address:*, attach-type:direct} {vendor_id:1234,product_id:0081, address:*, attach-type:direct} ], sriov_info: { nic_type:fast, network_ids: [*] # this means could attach to any network } } The idea being the VIF driver gets passed this info, when network_info includes a nic that matches. Any other details, like VLAN id, would come from neutron, and passed to the VIF driver as normal. 3) Reading nic_type and doing the PCI passthrough of NIC user requests Not sure we are agreed on this, but basically: * network_info contains nic-type from neutron * need to select the correct VIF
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On 10 January 2014 07:40, Jiang, Yunhong yunhong.ji...@intel.com wrote: Robert, sorry that I’m not fan of * your group * term. To me, *your group” mixed two thing. It’s an extra property provided by configuration, and also it’s a very-not-flexible mechanism to select devices (you can only select devices based on the ‘group name’ property). It is exactly that. It's 0 new config items, 0 new APIs, just an extra tag on the whitelists that are already there (although the proposal suggests changing the name of them to be more descriptive of what they now do). And you talk about flexibility as if this changes frequently, but in fact the grouping / aliasing of devices almost never changes after installation, which is, not coincidentally, when the config on the compute nodes gets set up. 1) A dynamic group is much better. For example, user may want to select GPU device based on vendor id, or based on vendor_id+device_id. In another word, user want to create group based on vendor_id, or vendor_id+device_id and select devices from these group. John’s proposal is very good, to provide an API to create the PCI flavor(or alias). I prefer flavor because it’s more openstack style. I disagree with this. I agree that what you're saying offers a more flexibilibility after initial installation but I have various issues with it. This is directly related to the hardware configuation on each compute node. For (some) other things of this nature, like provider networks, the compute node is the only thing that knows what it has attached to it, and it is the store (in configuration) of that information. If I add a new compute node then it's my responsibility to configure it correctly on attachment, but when I add a compute node (when I'm setting the cluster up, or sometime later on) then it's at that precise point that I know how I've attached it and what hardware it's got on it. Also, it's at this that point in time that I write out the configuration file (not by hand, note; there's almost certainly automation when configuring hundreds of nodes so arguments that 'if I'm writing hundreds of config files one will be wrong' are moot). I'm also not sure there's much reason to change the available devices dynamically after that, since that's normally an activity that results from changing the physical setup of the machine which implies that actually you're going to have access to and be able to change the config as you do it. John did come up with one case where you might be trying to remove old GPUs from circulation, but it's a very uncommon case that doesn't seem worth coding for, and it's still achievable by changing the config and restarting the compute processes. This also reduces the autonomy of the compute node in favour of centralised tracking, which goes against the 'distributed where possible' philosophy of Openstack. Finally, you're not actually removing configuration from the compute node. You still have to configure a whitelist there; in the grouping design you also have to configure grouping (flavouring) on the control node as well. The groups proposal adds one extra piece of information to the whitelists that are already there to mark groups, not a whole new set of config lines. To compare scheduling behaviour: If I need 4G of RAM, each compute node has reported its summary of free RAM to the scheduler. I look for a compute node with 4G free, and filter the list of compute nodes down. This is a query on n records, n being the number of compute nodes. I schedule to the compute node, which then confirms it does still have 4G free and runs the VM or rejects the request. If I need 3 PCI devices and use the current system, each machine has reported its device allocations to the scheduler. With SRIOV multiplying up the number of available devices, it's reporting back hundreds of records per compute node to the schedulers, and the filtering activity is a 3 queries on n * number of PCI devices in cloud records, which could easily end up in the tens or even hundreds of thousands of records for a moderately sized cloud. There compute node also has a record of its device allocations which is also checked and updated before the final request is run. If I need 3 PCI devices and use the groups system, each machine has reported its device *summary* to the scheduler. With SRIOV multiplying up the number of available devices, it's still reporting one or a small number of categories, i.e. { net: 100}. The difficulty of scheduling is a query on num groups * n records - fewer, in fact, if some machines have no passthrough devices. You can see that there's quite a cost to be paid for having those flexible alias APIs. 4) IMHO, the core for nova PCI support is **PCI property**. The property means not only generic PCI devices like vendor id, device id, device type, compute specific property like BDF address, the adjacent switch IP address, but also user defined property like
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
In any case, we don't have to decide this now. If we simply allowed the whitelist to add extra arbitrary properties to the PCI record (like a group name) and return it to the central server, we could use that in scheduling for the minute as a group name, we wouldn't implement the APIs for flavors yet, and we could get a working system that would be minimally changed from what we already have. We could worry about the scheduling in the scheduling group, and we could leave the APIs (which, as I say, are a minimally useful feature) untl later. then we'd have something useful in short order. -- Ian. On 10 January 2014 13:08, Ian Wells ijw.ubu...@cack.org.uk wrote: On 10 January 2014 07:40, Jiang, Yunhong yunhong.ji...@intel.com wrote: Robert, sorry that I’m not fan of * your group * term. To me, *your group” mixed two thing. It’s an extra property provided by configuration, and also it’s a very-not-flexible mechanism to select devices (you can only select devices based on the ‘group name’ property). It is exactly that. It's 0 new config items, 0 new APIs, just an extra tag on the whitelists that are already there (although the proposal suggests changing the name of them to be more descriptive of what they now do). And you talk about flexibility as if this changes frequently, but in fact the grouping / aliasing of devices almost never changes after installation, which is, not coincidentally, when the config on the compute nodes gets set up. 1) A dynamic group is much better. For example, user may want to select GPU device based on vendor id, or based on vendor_id+device_id. In another word, user want to create group based on vendor_id, or vendor_id+device_id and select devices from these group. John’s proposal is very good, to provide an API to create the PCI flavor(or alias). I prefer flavor because it’s more openstack style. I disagree with this. I agree that what you're saying offers a more flexibilibility after initial installation but I have various issues with it. This is directly related to the hardware configuation on each compute node. For (some) other things of this nature, like provider networks, the compute node is the only thing that knows what it has attached to it, and it is the store (in configuration) of that information. If I add a new compute node then it's my responsibility to configure it correctly on attachment, but when I add a compute node (when I'm setting the cluster up, or sometime later on) then it's at that precise point that I know how I've attached it and what hardware it's got on it. Also, it's at this that point in time that I write out the configuration file (not by hand, note; there's almost certainly automation when configuring hundreds of nodes so arguments that 'if I'm writing hundreds of config files one will be wrong' are moot). I'm also not sure there's much reason to change the available devices dynamically after that, since that's normally an activity that results from changing the physical setup of the machine which implies that actually you're going to have access to and be able to change the config as you do it. John did come up with one case where you might be trying to remove old GPUs from circulation, but it's a very uncommon case that doesn't seem worth coding for, and it's still achievable by changing the config and restarting the compute processes. This also reduces the autonomy of the compute node in favour of centralised tracking, which goes against the 'distributed where possible' philosophy of Openstack. Finally, you're not actually removing configuration from the compute node. You still have to configure a whitelist there; in the grouping design you also have to configure grouping (flavouring) on the control node as well. The groups proposal adds one extra piece of information to the whitelists that are already there to mark groups, not a whole new set of config lines. To compare scheduling behaviour: If I need 4G of RAM, each compute node has reported its summary of free RAM to the scheduler. I look for a compute node with 4G free, and filter the list of compute nodes down. This is a query on n records, n being the number of compute nodes. I schedule to the compute node, which then confirms it does still have 4G free and runs the VM or rejects the request. If I need 3 PCI devices and use the current system, each machine has reported its device allocations to the scheduler. With SRIOV multiplying up the number of available devices, it's reporting back hundreds of records per compute node to the schedulers, and the filtering activity is a 3 queries on n * number of PCI devices in cloud records, which could easily end up in the tens or even hundreds of thousands of records for a moderately sized cloud. There compute node also has a record of its device allocations which is also checked and updated before the final request is run. If I need 3 PCI
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Apologies for this top post, I just want to move this discussion towards action. I am traveling next week so it is unlikely that I can make the meetings. Sorry. Can we please agree on some concrete actions, and who will do the coding? This also means raising new blueprints for each item of work. I am happy to review and eventually approve those blueprints, if you email me directly. Ideas are taken from what we started to agree on, mostly written up here: https://wiki.openstack.org/wiki/Meetings/Passthrough#Definitions What doesn't need doing... We have PCI whitelist and PCI alias at the moment, let keep those names the same for now. I personally prefer PCI-flavor, rather than PCI-alias, but lets discuss any rename separately. We seemed happy with the current system (roughly) around GPU passthrough: nova flavor-key three_GPU_attached_30GB set pci_passthrough:alias= large_GPU:1,small_GPU:2 nova boot --image some_image --flavor three_GPU_attached_30GB some_name Again, we seemed happy with the current PCI whitelist. Sure, we could optimise the scheduling, but again, please keep that a separate discussion. Something in the scheduler needs to know how many of each PCI alias are available on each host. How that information gets there can be change at a later date. PCI alias is in config, but its probably better defined using host aggregates, or some custom API. But lets leave that for now, and discuss it separately. If the need arrises, we can migrate away from the config. What does need doing... == 1) API CLI changes for nic-type, and associated tempest tests * Add a user visible nic-type so users can express on of several network types. * We need a default nic-type, for when the user doesn't specify one (might default to SRIOV in some cases) * We can easily test the case where the default is virtual and the user expresses a preference for virtual * Above is much better than not testing it at all. nova boot --flavor m1.large --image image_id --nic net-id=net-id-1 --nic net-id=net-id-2,nic-type=fast --nic net-id=net-id-3,nic-type=fast vm-name or neutron port-create --fixed-ip subnet_id=subnet-id,ip_address=192.168.57.101 --nic-type=slow | fast | foobar net-id nova boot --flavor m1.large --image image_id --nic port-id=port-id Where nic-type is just an extra bit metadata string that is passed to nova and the VIF driver. 2) Expand PCI alias information We need extensions to PCI alias so we can group SRIOV devices better. I still think we are yet to agree on a format, but I would suggest this as a starting point: { name:GPU_fast, devices:[ {vendor_id:1137,product_id:0071, address:*, attach-type:direct}, {vendor_id:1137,product_id:0072, address:*, attach-type:direct} ], sriov_info: {} } { name:NIC_fast, devices:[ {vendor_id:1137,product_id:0071, address:0:[1-50]:2:*, attach-type:macvtap} {vendor_id:1234,product_id:0081, address:*, attach-type:direct} ], sriov_info: { nic_type:fast, network_ids: [net-id-1, net-id-2] } } { name:NIC_slower, devices:[ {vendor_id:1137,product_id:0071, address:*, attach-type:direct} {vendor_id:1234,product_id:0081, address:*, attach-type:direct} ], sriov_info: { nic_type:fast, network_ids: [*] # this means could attach to any network } } The idea being the VIF driver gets passed this info, when network_info includes a nic that matches. Any other details, like VLAN id, would come from neutron, and passed to the VIF driver as normal. 3) Reading nic_type and doing the PCI passthrough of NIC user requests Not sure we are agreed on this, but basically: * network_info contains nic-type from neutron * need to select the correct VIF driver * need to pass matching PCI alias information to VIF driver * neutron passes details other details (like VLAN id) as before * nova gives VIF driver an API that allows it to attach PCI devices that are in the whitelist to the VM being configured * with all this, the VIF driver can do what it needs to do * lets keep it simple, and expand it as the need arrises 4) Make changes to VIF drivers, so the above is implemented Depends on (3) These seems like some good steps to get the basics in place for PCI passthrough networking. Once its working, we can review it and see if there are things that need to evolve further. Does that seem like a workable approach? Who is willing to implement any of (1), (2) and (3)? Cheers, John On 9 January 2014 17:47, Ian Wells ijw.ubu...@cack.org.uk wrote: I think I'm in agreement with all of this. Nice summary, Robert. It may not be where the work ends, but if we could get this done the rest is just refinement. On 9 January 2014 17:49, Robert Li (baoli) ba...@cisco.com wrote: Hi Folks, With John joining the IRC, so far, we had a couple of productive meetings in an effort to come to consensus and move forward. Thanks John for doing that, and I appreciate everyone's effort to make it to the daily
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
+1 PCI Flavor. From: Jiang, Yunhong [mailto:yunhong.ji...@intel.com] Sent: January-10-14 1:56 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support BTW, I like the PCI flavor :) From: Jiang, Yunhong [mailto:yunhong.ji...@intel.com] Sent: Thursday, January 09, 2014 10:41 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi, Ian, when you in aggrement with all of this, do you agree with the 'group name', or agree with John's pci flavor? I'm against the PCI group and will send out a reply later. --jyh From: Ian Wells [mailto:ijw.ubu...@cack.org.uk] Sent: Thursday, January 09, 2014 9:47 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support I think I'm in agreement with all of this. Nice summary, Robert. It may not be where the work ends, but if we could get this done the rest is just refinement. On 9 January 2014 17:49, Robert Li (baoli) ba...@cisco.commailto:ba...@cisco.com wrote: Hi Folks, With John joining the IRC, so far, we had a couple of productive meetings in an effort to come to consensus and move forward. Thanks John for doing that, and I appreciate everyone's effort to make it to the daily meeting. Let's reconvene on Monday. But before that, and based on our today's conversation on IRC, I'd like to say a few things. I think that first of all, we need to get agreement on the terminologies that we are using so far. With the current nova PCI passthrough PCI whitelist: defines all the available PCI passthrough devices on a compute node. pci_passthrough_whitelist=[{ vendor_id:,product_id:}] PCI Alias: criteria defined on the controller node with which requested PCI passthrough devices can be selected from all the PCI passthrough devices available in a cloud. Currently it has the following format: pci_alias={vendor_id:, product_id:, name:str} nova flavor extra_specs: request for PCI passthrough devices can be specified with extra_specs in the format for example:pci_passthrough:alias=name:count As you can see, currently a PCI alias has a name and is defined on the controller. The implications for it is that when matching it against the PCI devices, it has to match the vendor_id and product_id against all the available PCI devices until one is found. The name is only used for reference in the extra_specs. On the other hand, the whitelist is basically the same as the alias without a name. What we have discussed so far is based on something called PCI groups (or PCI flavors as Yongli puts it). Without introducing other complexities, and with a little change of the above representation, we will have something like: pci_passthrough_whitelist=[{ vendor_id:,product_id:, name:str}] By doing so, we eliminated the PCI alias. And we call the name in above as a PCI group name. You can think of it as combining the definitions of the existing whitelist and PCI alias. And believe it or not, a PCI group is actually a PCI alias. However, with that change of thinking, a lot of benefits can be harvested: * the implementation is significantly simplified * provisioning is simplified by eliminating the PCI alias * a compute node only needs to report stats with something like: PCI group name:count. A compute node processes all the PCI passthrough devices against the whitelist, and assign a PCI group based on the whitelist definition. * on the controller, we may only need to define the PCI group names. if we use a nova api to define PCI groups (could be private or public, for example), one potential benefit, among other things (validation, etc), they can be owned by the tenant that creates them. And thus a wholesale of PCI passthrough devices is also possible. * scheduler only works with PCI group names. * request for PCI passthrough device is based on PCI-group * deployers can provision the cloud based on the PCI groups * Particularly for SRIOV, deployers can design SRIOV PCI groups based on network connectivities. Further, to support SRIOV, we are saying that PCI group names not only can be used in the extra specs, it can also be used in the -nic option and the neutron commands. This allows the most flexibilities and functionalities afforded by SRIOV. Further, we are saying that we can define default PCI groups based on the PCI device's class. For vnic-type (or nic-type), we are saying that it defines the link characteristics of the nic that is attached to a VM: a nic that's connected to a virtual switch, a nic that is connected to a physical switch, or a nic that is connected to a physical switch, but has a host macvtap device in between. The actual
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
design, I presume. The bottom line is that we want those requirements to be met. 4) IMHO, the core for nova PCI support is *PCI property*. The property means not only generic PCI devices like vendor id, device id, device type, compute specific property like BDF address, the adjacent switch IP address, but also user defined property like nuertron’s physical net name etc. And then, it’s about how to get these property, how to select/group devices based on the property, how to store/fetch these properties. I agree. But that's exactly what we are trying to accomplish. Thanks --jyh From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Thursday, January 09, 2014 8:49 AM To: OpenStack Development Mailing List (not for usage questions); Irena Berezovsky; Sandhya Dasu (sadasu); Jiang, Yunhong; Itzik Brown; j...@johngarbutt.commailto:j...@johngarbutt.com; He, Yongli Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi Folks, With John joining the IRC, so far, we had a couple of productive meetings in an effort to come to consensus and move forward. Thanks John for doing that, and I appreciate everyone's effort to make it to the daily meeting. Let's reconvene on Monday. But before that, and based on our today's conversation on IRC, I'd like to say a few things. I think that first of all, we need to get agreement on the terminologies that we are using so far. With the current nova PCI passthrough PCI whitelist: defines all the available PCI passthrough devices on a compute node. pci_passthrough_whitelist=[{ vendor_id:,product_id:}] PCI Alias: criteria defined on the controller node with which requested PCI passthrough devices can be selected from all the PCI passthrough devices available in a cloud. Currently it has the following format: pci_alias={vendor_id:, product_id:, name:str} nova flavor extra_specs: request for PCI passthrough devices can be specified with extra_specs in the format for example:pci_passthrough:alias=name:count As you can see, currently a PCI alias has a name and is defined on the controller. The implications for it is that when matching it against the PCI devices, it has to match the vendor_id and product_id against all the available PCI devices until one is found. The name is only used for reference in the extra_specs. On the other hand, the whitelist is basically the same as the alias without a name. What we have discussed so far is based on something called PCI groups (or PCI flavors as Yongli puts it). Without introducing other complexities, and with a little change of the above representation, we will have something like: pci_passthrough_whitelist=[{ vendor_id:,product_id:, name:str}] By doing so, we eliminated the PCI alias. And we call the name in above as a PCI group name. You can think of it as combining the definitions of the existing whitelist and PCI alias. And believe it or not, a PCI group is actually a PCI alias. However, with that change of thinking, a lot of benefits can be harvested: * the implementation is significantly simplified * provisioning is simplified by eliminating the PCI alias * a compute node only needs to report stats with something like: PCI group name:count. A compute node processes all the PCI passthrough devices against the whitelist, and assign a PCI group based on the whitelist definition. * on the controller, we may only need to define the PCI group names. if we use a nova api to define PCI groups (could be private or public, for example), one potential benefit, among other things (validation, etc), they can be owned by the tenant that creates them. And thus a wholesale of PCI passthrough devices is also possible. * scheduler only works with PCI group names. * request for PCI passthrough device is based on PCI-group * deployers can provision the cloud based on the PCI groups * Particularly for SRIOV, deployers can design SRIOV PCI groups based on network connectivities. Further, to support SRIOV, we are saying that PCI group names not only can be used in the extra specs, it can also be used in the —nic option and the neutron commands. This allows the most flexibilities and functionalities afforded by SRIOV. Further, we are saying that we can define default PCI groups based on the PCI device's class. For vnic-type (or nic-type), we are saying that it defines the link characteristics of the nic that is attached to a VM: a nic that's connected to a virtual switch, a nic that is connected to a physical switch, or a nic that is connected to a physical switch, but has a host macvtap device in between. The actual names of the choices are not important here, and can be debated. I'm hoping that we can go over the above on Monday. But any comments are welcome by email. Thanks, Robert
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi Yongli, Please also see my response to Yunhong. Here, I just want to add a comment about your local versus global argument. I took a brief look at your patches, and the PCI-flavor is added into the whitelist. The compute node needs to know these pci-flavors in order to report PCI stats based on them. Please correct me if I'm wrong. Another comment is that a compute node doesn't need to consult with the controller, but it's report or registration of resources may be rejected by the controller due to non-existing PCI groups. thanks, Robert On 1/10/14 2:11 AM, yongli he yongli...@intel.commailto:yongli...@intel.com wrote: On 2014年01月10日 00:49, Robert Li (baoli) wrote: Hi Folks, HI, all basiclly i flavor the pic-flavor style and against massing the white-list. please see my inline comments. With John joining the IRC, so far, we had a couple of productive meetings in an effort to come to consensus and move forward. Thanks John for doing that, and I appreciate everyone's effort to make it to the daily meeting. Let's reconvene on Monday. But before that, and based on our today's conversation on IRC, I'd like to say a few things. I think that first of all, we need to get agreement on the terminologies that we are using so far. With the current nova PCI passthrough PCI whitelist: defines all the available PCI passthrough devices on a compute node. pci_passthrough_whitelist=[{ vendor_id:,product_id:}] PCI Alias: criteria defined on the controller node with which requested PCI passthrough devices can be selected from all the PCI passthrough devices available in a cloud. Currently it has the following format: pci_alias={vendor_id:, product_id:, name:str} nova flavor extra_specs: request for PCI passthrough devices can be specified with extra_specs in the format for example:pci_passthrough:alias=name:count As you can see, currently a PCI alias has a name and is defined on the controller. The implications for it is that when matching it against the PCI devices, it has to match the vendor_id and product_id against all the available PCI devices until one is found. The name is only used for reference in the extra_specs. On the other hand, the whitelist is basically the same as the alias without a name. What we have discussed so far is based on something called PCI groups (or PCI flavors as Yongli puts it). Without introducing other complexities, and with a little change of the above representation, we will have something like: pci_passthrough_whitelist=[{ vendor_id:,product_id:, name:str}] By doing so, we eliminated the PCI alias. And we call the name in above as a PCI group name. You can think of it as combining the definitions of the existing whitelist and PCI alias. And believe it or not, a PCI group is actually a PCI alias. However, with that change of thinking, a lot of the white list configuration is mostly local to a host, so only address in there, like John's proposal is good. mix the group into the whitelist means we make the global thing per host style, this is maybe wrong. benefits can be harvested: * the implementation is significantly simplified but more mass, refer my new patches already sent out. * provisioning is simplified by eliminating the PCI alias pci alias provide a good way to define a global reference-able name for PCI, we need this, this is also true for John's pci-flavor. * a compute node only needs to report stats with something like: PCI group name:count. A compute node processes all the PCI passthrough devices against the whitelist, and assign a PCI group based on the whitelist definition. simplify this seems like good, but it does not, separated the local and global is the instinct nature simplify. * on the controller, we may only need to define the PCI group names. if we use a nova api to define PCI groups (could be private or public, for example), one potential benefit, among other things (validation, etc), they can be owned by the tenant that creates them. And thus a wholesale of PCI passthrough devices is also possible. this mean you should consult the controller to deploy your host, if we keep white-list local, we simplify the deploy. * scheduler only works with PCI group names. * request for PCI passthrough device is based on PCI-group * deployers can provision the cloud based on the PCI groups * Particularly for SRIOV, deployers can design SRIOV PCI groups based on network connectivities. Further, to support SRIOV, we are saying that PCI group names not only can be used in the extra specs, it can also be used in the —nic option and the neutron commands. This allows the most flexibilities and functionalities afforded by SRIOV. i still feel use alias/pci flavor is better solution. Further, we are saying that we can define default PCI groups based on the PCI device's class. default
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Ian, thanks for your reply. Please check my response prefix with 'yjiang5'. --jyh From: Ian Wells [mailto:ijw.ubu...@cack.org.uk] Sent: Friday, January 10, 2014 4:08 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support On 10 January 2014 07:40, Jiang, Yunhong yunhong.ji...@intel.commailto:yunhong.ji...@intel.com wrote: Robert, sorry that I'm not fan of * your group * term. To me, *your group mixed two thing. It's an extra property provided by configuration, and also it's a very-not-flexible mechanism to select devices (you can only select devices based on the 'group name' property). It is exactly that. It's 0 new config items, 0 new APIs, just an extra tag on the whitelists that are already there (although the proposal suggests changing the name of them to be more descriptive of what they now do). And you talk about flexibility as if this changes frequently, but in fact the grouping / aliasing of devices almost never changes after installation, which is, not coincidentally, when the config on the compute nodes gets set up. 1) A dynamic group is much better. For example, user may want to select GPU device based on vendor id, or based on vendor_id+device_id. In another word, user want to create group based on vendor_id, or vendor_id+device_id and select devices from these group. John's proposal is very good, to provide an API to create the PCI flavor(or alias). I prefer flavor because it's more openstack style. I disagree with this. I agree that what you're saying offers a more flexibilibility after initial installation but I have various issues with it. [yjiang5] I think you talking is mostly about white list, instead of PCI flavor. PCI flavor is more about PCI request, like I want to have a device with vendor_id = cisco, device_id= 15454E, or 'vendor_id=intel device_class=nic' , ( because the image have the driver for all Intel NIC card :) ). While whitelist is to decide the device that is assignable in a host. This is directly related to the hardware configuation on each compute node. For (some) other things of this nature, like provider networks, the compute node is the only thing that knows what it has attached to it, and it is the store (in configuration) of that information. If I add a new compute node then it's my responsibility to configure it correctly on attachment, but when I add a compute node (when I'm setting the cluster up, or sometime later on) then it's at that precise point that I know how I've attached it and what hardware it's got on it. Also, it's at this that point in time that I write out the configuration file (not by hand, note; there's almost certainly automation when configuring hundreds of nodes so arguments that 'if I'm writing hundreds of config files one will be wrong' are moot). I'm also not sure there's much reason to change the available devices dynamically after that, since that's normally an activity that results from changing the physical setup of the machine which implies that actually you're going to have access to and be able to change the config as you do it. John did come up with one case where you might be trying to remove old GPUs from circulation, but it's a very uncommon case that doesn't seem worth coding for, and it's still achievable by changing the config and restarting the compute processes. [yjiag5] I totally agree with you that whitelist is static defined when provision. I just want to separate the information of 'provider network' to another configuration (like extra information). Whitelist is just white list to decide the device assignable. The provider network is information of the device, it's not in the scope of the white list. This also reduces the autonomy of the compute node in favour of centralised tracking, which goes against the 'distributed where possible' philosophy of Openstack. Finally, you're not actually removing configuration from the compute node. You still have to configure a whitelist there; in the grouping design you also have to configure grouping (flavouring) on the control node as well. The groups proposal adds one extra piece of information to the whitelists that are already there to mark groups, not a whole new set of config lines. [yjiang5] Still, while list is to decide the device assignable, not to provide device information. We should mixed functionality to the configuration. If it's ok, I simply want to discard the 'group' term :) The nova PCI flow is simple, compute node provide PCI device (based on white list), the scheduler track the PCI device information (abstracted as pci_stats for performance issue), the API provide method that user specify the device they wanted (the PCI flavor). Current implementation need enhancement on each step of the flow, but I really see no reason to have the Group. Yes, the 'PCI flavor' in fact create group based on PCI
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Brian, the issue of 'class name' is because currently the libvirt does not provide such information, otherwise we are glad to add that :( But this is a good point and we have considered already. One solution is to retrieve it through some code like read the configuration space directly. But that's not so easy especially considering the different platform has different method to get the configuration space. A workaround (at least in first step) is to use the user defined property, so that user can define it through configuration space. The issue to udev is, it's linux specific, and it may even various in different distribution. Thanks --jyh From: Brian Schott [mailto:brian.sch...@nimbisservices.com] Sent: Thursday, January 09, 2014 11:19 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Ian, The idea of pci flavors is a great and using vendor_id and product_id make sense, but I could see a case for adding the class name such as 'VGA compatible controller'. Otherwise, slightly different generations of hardware will mean custom whitelist setups on each compute node. 01:00.0 VGA compatible controller: NVIDIA Corporation G71 [GeForce 7900 GTX] (rev a1) On the flip side, vendor_id and product_id might not be sufficient. Suppose I have two identical NICs, one for nova internal use and the second for guest tenants? So, bus numbering may be required. 01:00.0 VGA compatible controller: NVIDIA Corporation G71 [GeForce 7900 GTX] (rev a1) 02:00.0 VGA compatible controller: NVIDIA Corporation G71 [GeForce 7900 GTX] (rev a1) Some possible combinations: # take 2 gpus pci_passthrough_whitelist=[ { vendor_id:NVIDIA Corporation G71,product_id:GeForce 7900 GTX, name:GPU}, ] # only take the GPU on PCI 2 pci_passthrough_whitelist=[ { vendor_id:NVIDIA Corporation G71,product_id:GeForce 7900 GTX, 'bus_id': '02:', name:GPU}, ] pci_passthrough_whitelist=[ {bus_id: 01:00.0, name: GPU}, {bus_id: 02:00.0, name: GPU}, ] pci_passthrough_whitelist=[ {class: VGA compatible controller, name: GPU}, ] pci_passthrough_whitelist=[ { product_id:GeForce 7900 GTX, name:GPU}, ] I know you guys are thinking of PCI devices, but any though of mapping to something like udev rather than pci? Supporting udev rules might be easier and more robust rather than making something up. Brian - Brian Schott, CTO Nimbis Services, Inc. brian.sch...@nimbisservices.commailto:brian.sch...@nimbisservices.com ph: 443-274-6064 fx: 443-274-6060 On Jan 9, 2014, at 12:47 PM, Ian Wells ijw.ubu...@cack.org.ukmailto:ijw.ubu...@cack.org.uk wrote: I think I'm in agreement with all of this. Nice summary, Robert. It may not be where the work ends, but if we could get this done the rest is just refinement. On 9 January 2014 17:49, Robert Li (baoli) ba...@cisco.commailto:ba...@cisco.com wrote: Hi Folks, With John joining the IRC, so far, we had a couple of productive meetings in an effort to come to consensus and move forward. Thanks John for doing that, and I appreciate everyone's effort to make it to the daily meeting. Let's reconvene on Monday. But before that, and based on our today's conversation on IRC, I'd like to say a few things. I think that first of all, we need to get agreement on the terminologies that we are using so far. With the current nova PCI passthrough PCI whitelist: defines all the available PCI passthrough devices on a compute node. pci_passthrough_whitelist=[{ vendor_id:,product_id:}] PCI Alias: criteria defined on the controller node with which requested PCI passthrough devices can be selected from all the PCI passthrough devices available in a cloud. Currently it has the following format: pci_alias={vendor_id:, product_id:, name:str} nova flavor extra_specs: request for PCI passthrough devices can be specified with extra_specs in the format for example:pci_passthrough:alias=name:count As you can see, currently a PCI alias has a name and is defined on the controller. The implications for it is that when matching it against the PCI devices, it has to match the vendor_id and product_id against all the available PCI devices until one is found. The name is only used for reference in the extra_specs. On the other hand, the whitelist is basically the same as the alias without a name. What we have discussed so far is based on something called PCI groups (or PCI flavors as Yongli puts it). Without introducing other complexities, and with a little change of the above representation, we will have something like: pci_passthrough_whitelist=[{ vendor_id:,product_id:, name:str}] By doing so, we eliminated the PCI alias. And we call the name in above as a PCI group name. You can think of it as combining the definitions of the existing whitelist and PCI alias
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On 10 January 2014 15:30, John Garbutt j...@johngarbutt.com wrote: We seemed happy with the current system (roughly) around GPU passthrough: nova flavor-key three_GPU_attached_30GB set pci_passthrough:alias= large_GPU:1,small_GPU:2 nova boot --image some_image --flavor three_GPU_attached_30GB some_name Actually, I think we pretty solidly disagree on this point. On the other hand, Yongli's current patch (with pci_flavor in the whitelist) is pretty OK. nova boot --flavor m1.large --image image_id --nic net-id=net-id-1 --nic net-id=net-id-2,nic-type=fast --nic net-id=net-id-3,nic-type=fast vm-name With flavor defined (wherever it's defined): nova boot .. --nic net-id=net-id-1,pci-flavor=xxx# ok, presumably defaults to PCI passthrough --nic net-id=net-id-1,pci-flavor=xxx,vnic-attach=macvtap # ok --nic net-id=net-id-1 # ok - no flavor = vnic --nic port-id=net-id-1,pci-flavor=xxx# ok, gets vnic-attach from port --nic port-id=net-id-1 # ok - no flavor = vnic or neutron port-create --fixed-ip subnet_id=subnet-id,ip_address=192.168.57.101 --nic-type=slow | fast | foobar net-id nova boot --flavor m1.large --image image_id --nic port-id=port-id No, I think not - specifically because flavors are a nova concept and not a neutron one, so putting them on the port is inappropriate. Conversely, vnic-attach is a Neutron concept (fine, nova implements it, but Neutron tells it how) so I think it *is* a port field, and we'd just set it on the newly created port when doing nova boot ..,vnic-attach=thing 2) Expand PCI alias information { name:NIC_fast, sriov_info: { nic_type:fast, network_ids: [net-id-1, net-id-2] Why can't we use the flavor name in --nic (because multiple flavors might be on one NIC type, I guess)? Where does e.g. switch/port information go, particularly since it's per-device (not per-group) and non-scheduling? I think the issue here is that you assume we group by flavor, then add extra info, then group into a NIC group. But for a lot of use cases there is information that differs on every NIC port, so it makes more sense to add extra info to a device, then group into flavor and that can also be used for the --nic. network_ids is interesting, but this is a nova config file and network_ids are (a) from Neutron (b) ephemeral, so we can't put them in config. They could be provider network names, but that's not the same thing as a neutron network name and not easily discoverable, outside of Neutron i.e. before scheduling. Again, Yongli's current change with pci-flavor in the whitelist records leads to a reasonable way to how to make this work here, I think; straightforward extra_info would be fine (though perhaps nice if it's easier to spot it as of a different type from the whitelist regex fields). ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hey Yunhong, The thing about 'group' and 'flavor' and 'whitelist' is that they once meant distinct things (and I think we've been trying to reduce them back from three things to two or one): - group: equivalent devices at a host level - use any one, no-one will care, because they're either identical or as near as makes no difference - flavor: equivalent devices to an end user - we may re-evaluate our offerings and group them differently on the fly - whitelist: either 'something to match the devices you may assign' (originally) or 'something to match the devices you may assign *and* put them in the group (in the group proposal) Bearing in mind what you said about scheduling, and if we skip 'group' for a moment, then can I suggest (or possibly restate, because your comments are pointing in this direction): - we allow extra information to be added at what is now the whitelisting stage, that just gets carried around with the device - when we're turning devices into flavors, we can also match on that extra information if we want (which means we can tag up the devices on the compute node if we like, according to taste, and then bundle them up by tag to make flavors; or we can add Neutron specific information and ignore it when making flavors) - we would need to add a config param on the control host to decide which flags to group on when doing the stats (and they would additionally be the only params that would work for flavors, I think) ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Ian, thanks for your reply. Please check comments prefix with [yjiang5]. Thanks --jyh From: Ian Wells [mailto:ijw.ubu...@cack.org.uk] Sent: Friday, January 10, 2014 12:17 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hey Yunhong, The thing about 'group' and 'flavor' and 'whitelist' is that they once meant distinct things (and I think we've been trying to reduce them back from three things to two or one): - group: equivalent devices at a host level - use any one, no-one will care, because they're either identical or as near as makes no difference - flavor: equivalent devices to an end user - we may re-evaluate our offerings and group them differently on the fly - whitelist: either 'something to match the devices you may assign' (originally) or 'something to match the devices you may assign *and* put them in the group (in the group proposal) [yjiang5] Really thanks for the summary and it is quite clear. So what's the object of equivalent devices at host level? Because 'equivalent device * to an end user * is flavor, so is it 'equivalent to *scheduler* or 'equivalent to *xxx*'? If equivalent to scheduler, then I'd take the pci_stats as a flexible group for scheduler, and I'd think 'equivalent for scheduler' as a restriction for 'equivalent to end user' because of performance issue, otherwise, it's needless. Secondly, for your definition of 'whitelist', I'm hesitate to your '*and*' because IMHO, 'and' means mixed two things together, otherwise, we can state in simply one sentence. For example, I prefer to have another configuration option to define the 'put devices in the group', or, if we extend it , be define extra information like 'group name' for devices. Bearing in mind what you said about scheduling, and if we skip 'group' for a moment, then can I suggest (or possibly restate, because your comments are pointing in this direction): - we allow extra information to be added at what is now the whitelisting stage, that just gets carried around with the device [yjiang5] For 'added at ... whitelisting stage', see my above statement about the configuration. However, if you do want to use whitelist, I'm ok, but please keep in mind that it's two functionality combined: device you may assign *and* the group name for these devices. - when we're turning devices into flavors, we can also match on that extra information if we want (which means we can tag up the devices on the compute node if we like, according to taste, and then bundle them up by tag to make flavors; or we can add Neutron specific information and ignore it when making flavors) [yjiang5] Agree. Currently we can only use vendor_id and device_id for flavor/alias, but we can extend it to cover such extra information since now it's a API. - we would need to add a config param on the control host to decide which flags to group on when doing the stats (and they would additionally be the only params that would work for flavors, I think) [yjiang5] Agree. And this is achievable because we switch the flavor to be API, then we can control the flavor creation process. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
I have to use [yjiang5_1] prefix now :) --jyh From: Ian Wells [mailto:ijw.ubu...@cack.org.uk] Sent: Friday, January 10, 2014 3:55 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support On 11 January 2014 00:04, Jiang, Yunhong yunhong.ji...@intel.commailto:yunhong.ji...@intel.com wrote: [yjiang5] Really thanks for the summary and it is quite clear. So what's the object of equivalent devices at host level? Because 'equivalent device * to an end user * is flavor, so is it 'equivalent to *scheduler* or 'equivalent to *xxx*'? If equivalent to scheduler, then I'd take the pci_stats as a flexible group for scheduler To the scheduler, indeed. And with the group proposal the scheduler and end user equivalences are one and the same. [yjiang5_1] Once use the proposal, then we missed the flexible for 'end user equivalences and that's the reason I'm against the group :) Secondly, for your definition of 'whitelist', I'm hesitate to your '*and*' because IMHO, 'and' means mixed two things together, otherwise, we can state in simply one sentence. For example, I prefer to have another configuration option to define the 'put devices in the group', or, if we extend it , be define extra information like 'group name' for devices. I'm not stating what we should do, or what the definitions should mean; I'm saying how they've been interpreted as weve discussed this in the past. We've had issues in the past where we've had continuing difficulties in describing anything without coming back to a 'whitelist' (generally meaning 'matching expression, as an actual 'whitelist' is implied, rather than separately required, in a grouping system. Bearing in mind what you said about scheduling, and if we skip 'group' for a moment, then can I suggest (or possibly restate, because your comments are pointing in this direction): - we allow extra information to be added at what is now the whitelisting stage, that just gets carried around with the device [yjiang5] For 'added at ... whitelisting stage', see my above statement about the configuration. However, if you do want to use whitelist, I'm ok, but please keep in mind that it's two functionality combined: device you may assign *and* the group name for these devices. Indeed - which is in fact what we've been proposing all along. - when we're turning devices into flavors, we can also match on that extra information if we want (which means we can tag up the devices on the compute node if we like, according to taste, and then bundle them up by tag to make flavors; or we can add Neutron specific information and ignore it when making flavors) [yjiang5] Agree. Currently we can only use vendor_id and device_id for flavor/alias, but we can extend it to cover such extra information since now it's a API. - we would need to add a config param on the control host to decide which flags to group on when doing the stats (and they would additionally be the only params that would work for flavors, I think) [yjiang5] Agree. And this is achievable because we switch the flavor to be API, then we can control the flavor creation process. OK - so if this is good then I think the question is how we could change the 'pci_whitelist' parameter we have - which, as you say, should either *only* do whitelisting or be renamed - to allow us to add information. Yongli has something along those lines but it's not flexible and it distinguishes poorly between which bits are extra information and which bits are matching expressions (and it's still called pci_whitelist) - but even with those criticisms it's very close to what we're talking about. When we have that I think a lot of the rest of the arguments should simply resolve themselves. [yjiang5_1] The reason that not easy to find a flexible/distinguishable change to pci_whitelist is because it combined two things. So a stupid/naive solution in my head is, change it to VERY generic name, 'pci_devices_information', and change schema as an array of {'devices_property'=regex exp, 'group_name' = 'g1'} dictionary, and the device_property expression can be 'address ==xxx, vendor_id == xxx' (i.e. similar with current white list), and we can squeeze more into the pci_devices_information in future, like 'network_information' = xxx or Neutron specific information you required in previous mail. All keys other than 'device_property' becomes extra information, i.e. software defined property. These extra information will be carried with the PCI devices,. Some implementation details, A)we can limit the acceptable keys, like we only support 'group_name', 'network_id', or we can accept any keys other than reserved (vendor_id, device_id etc) one. B) if a device match 'device_property' in several entries, raise exception, or use the first one. [yjiang5_1] Another thing need discussed is, as you pointed out, we would need to add a config
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi Folks, With John joining the IRC, so far, we had a couple of productive meetings in an effort to come to consensus and move forward. Thanks John for doing that, and I appreciate everyone's effort to make it to the daily meeting. Let's reconvene on Monday. But before that, and based on our today's conversation on IRC, I'd like to say a few things. I think that first of all, we need to get agreement on the terminologies that we are using so far. With the current nova PCI passthrough PCI whitelist: defines all the available PCI passthrough devices on a compute node. pci_passthrough_whitelist=[{ vendor_id:,product_id:}] PCI Alias: criteria defined on the controller node with which requested PCI passthrough devices can be selected from all the PCI passthrough devices available in a cloud. Currently it has the following format: pci_alias={vendor_id:, product_id:, name:str} nova flavor extra_specs: request for PCI passthrough devices can be specified with extra_specs in the format for example:pci_passthrough:alias=name:count As you can see, currently a PCI alias has a name and is defined on the controller. The implications for it is that when matching it against the PCI devices, it has to match the vendor_id and product_id against all the available PCI devices until one is found. The name is only used for reference in the extra_specs. On the other hand, the whitelist is basically the same as the alias without a name. What we have discussed so far is based on something called PCI groups (or PCI flavors as Yongli puts it). Without introducing other complexities, and with a little change of the above representation, we will have something like: pci_passthrough_whitelist=[{ vendor_id:,product_id:, name:str}] By doing so, we eliminated the PCI alias. And we call the name in above as a PCI group name. You can think of it as combining the definitions of the existing whitelist and PCI alias. And believe it or not, a PCI group is actually a PCI alias. However, with that change of thinking, a lot of benefits can be harvested: * the implementation is significantly simplified * provisioning is simplified by eliminating the PCI alias * a compute node only needs to report stats with something like: PCI group name:count. A compute node processes all the PCI passthrough devices against the whitelist, and assign a PCI group based on the whitelist definition. * on the controller, we may only need to define the PCI group names. if we use a nova api to define PCI groups (could be private or public, for example), one potential benefit, among other things (validation, etc), they can be owned by the tenant that creates them. And thus a wholesale of PCI passthrough devices is also possible. * scheduler only works with PCI group names. * request for PCI passthrough device is based on PCI-group * deployers can provision the cloud based on the PCI groups * Particularly for SRIOV, deployers can design SRIOV PCI groups based on network connectivities. Further, to support SRIOV, we are saying that PCI group names not only can be used in the extra specs, it can also be used in the —nic option and the neutron commands. This allows the most flexibilities and functionalities afforded by SRIOV. Further, we are saying that we can define default PCI groups based on the PCI device's class. For vnic-type (or nic-type), we are saying that it defines the link characteristics of the nic that is attached to a VM: a nic that's connected to a virtual switch, a nic that is connected to a physical switch, or a nic that is connected to a physical switch, but has a host macvtap device in between. The actual names of the choices are not important here, and can be debated. I'm hoping that we can go over the above on Monday. But any comments are welcome by email. Thanks, Robert ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
I think I'm in agreement with all of this. Nice summary, Robert. It may not be where the work ends, but if we could get this done the rest is just refinement. On 9 January 2014 17:49, Robert Li (baoli) ba...@cisco.com wrote: Hi Folks, With John joining the IRC, so far, we had a couple of productive meetings in an effort to come to consensus and move forward. Thanks John for doing that, and I appreciate everyone's effort to make it to the daily meeting. Let's reconvene on Monday. But before that, and based on our today's conversation on IRC, I'd like to say a few things. I think that first of all, we need to get agreement on the terminologies that we are using so far. With the current nova PCI passthrough PCI whitelist: defines all the available PCI passthrough devices on a compute node. pci_passthrough_whitelist=[{ vendor_id:,product_id:}] PCI Alias: criteria defined on the controller node with which requested PCI passthrough devices can be selected from all the PCI passthrough devices available in a cloud. Currently it has the following format: pci_alias={vendor_id:, product_id:, name:str} nova flavor extra_specs: request for PCI passthrough devices can be specified with extra_specs in the format for example: pci_passthrough:alias=name:count As you can see, currently a PCI alias has a name and is defined on the controller. The implications for it is that when matching it against the PCI devices, it has to match the vendor_id and product_id against all the available PCI devices until one is found. The name is only used for reference in the extra_specs. On the other hand, the whitelist is basically the same as the alias without a name. What we have discussed so far is based on something called PCI groups (or PCI flavors as Yongli puts it). Without introducing other complexities, and with a little change of the above representation, we will have something like: pci_passthrough_whitelist=[{ vendor_id:,product_id:, name:str}] By doing so, we eliminated the PCI alias. And we call the name in above as a PCI group name. You can think of it as combining the definitions of the existing whitelist and PCI alias. And believe it or not, a PCI group is actually a PCI alias. However, with that change of thinking, a lot of benefits can be harvested: * the implementation is significantly simplified * provisioning is simplified by eliminating the PCI alias * a compute node only needs to report stats with something like: PCI group name:count. A compute node processes all the PCI passthrough devices against the whitelist, and assign a PCI group based on the whitelist definition. * on the controller, we may only need to define the PCI group names. if we use a nova api to define PCI groups (could be private or public, for example), one potential benefit, among other things (validation, etc), they can be owned by the tenant that creates them. And thus a wholesale of PCI passthrough devices is also possible. * scheduler only works with PCI group names. * request for PCI passthrough device is based on PCI-group * deployers can provision the cloud based on the PCI groups * Particularly for SRIOV, deployers can design SRIOV PCI groups based on network connectivities. Further, to support SRIOV, we are saying that PCI group names not only can be used in the extra specs, it can also be used in the —nic option and the neutron commands. This allows the most flexibilities and functionalities afforded by SRIOV. Further, we are saying that we can define default PCI groups based on the PCI device's class. For vnic-type (or nic-type), we are saying that it defines the link characteristics of the nic that is attached to a VM: a nic that's connected to a virtual switch, a nic that is connected to a physical switch, or a nic that is connected to a physical switch, but has a host macvtap device in between. The actual names of the choices are not important here, and can be debated. I'm hoping that we can go over the above on Monday. But any comments are welcome by email. Thanks, Robert ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Ian, The idea of pci flavors is a great and using vendor_id and product_id make sense, but I could see a case for adding the class name such as 'VGA compatible controller'. Otherwise, slightly different generations of hardware will mean custom whitelist setups on each compute node. 01:00.0 VGA compatible controller: NVIDIA Corporation G71 [GeForce 7900 GTX] (rev a1) On the flip side, vendor_id and product_id might not be sufficient. Suppose I have two identical NICs, one for nova internal use and the second for guest tenants? So, bus numbering may be required. 01:00.0 VGA compatible controller: NVIDIA Corporation G71 [GeForce 7900 GTX] (rev a1) 02:00.0 VGA compatible controller: NVIDIA Corporation G71 [GeForce 7900 GTX] (rev a1) Some possible combinations: # take 2 gpus pci_passthrough_whitelist=[ { vendor_id:NVIDIA Corporation G71,product_id:GeForce 7900 GTX, name:GPU}, ] # only take the GPU on PCI 2 pci_passthrough_whitelist=[ { vendor_id:NVIDIA Corporation G71,product_id:GeForce 7900 GTX, 'bus_id': '02:', name:GPU}, ] pci_passthrough_whitelist=[ {bus_id: 01:00.0, name: GPU}, {bus_id: 02:00.0, name: GPU}, ] pci_passthrough_whitelist=[ {class: VGA compatible controller, name: GPU}, ] pci_passthrough_whitelist=[ { product_id:GeForce 7900 GTX, name:GPU}, ] I know you guys are thinking of PCI devices, but any though of mapping to something like udev rather than pci? Supporting udev rules might be easier and more robust rather than making something up. Brian - Brian Schott, CTO Nimbis Services, Inc. brian.sch...@nimbisservices.com ph: 443-274-6064 fx: 443-274-6060 On Jan 9, 2014, at 12:47 PM, Ian Wells ijw.ubu...@cack.org.uk wrote: I think I'm in agreement with all of this. Nice summary, Robert. It may not be where the work ends, but if we could get this done the rest is just refinement. On 9 January 2014 17:49, Robert Li (baoli) ba...@cisco.com wrote: Hi Folks, With John joining the IRC, so far, we had a couple of productive meetings in an effort to come to consensus and move forward. Thanks John for doing that, and I appreciate everyone's effort to make it to the daily meeting. Let's reconvene on Monday. But before that, and based on our today's conversation on IRC, I'd like to say a few things. I think that first of all, we need to get agreement on the terminologies that we are using so far. With the current nova PCI passthrough PCI whitelist: defines all the available PCI passthrough devices on a compute node. pci_passthrough_whitelist=[{ vendor_id:,product_id:}] PCI Alias: criteria defined on the controller node with which requested PCI passthrough devices can be selected from all the PCI passthrough devices available in a cloud. Currently it has the following format: pci_alias={vendor_id:, product_id:, name:str} nova flavor extra_specs: request for PCI passthrough devices can be specified with extra_specs in the format for example:pci_passthrough:alias=name:count As you can see, currently a PCI alias has a name and is defined on the controller. The implications for it is that when matching it against the PCI devices, it has to match the vendor_id and product_id against all the available PCI devices until one is found. The name is only used for reference in the extra_specs. On the other hand, the whitelist is basically the same as the alias without a name. What we have discussed so far is based on something called PCI groups (or PCI flavors as Yongli puts it). Without introducing other complexities, and with a little change of the above representation, we will have something like: pci_passthrough_whitelist=[{ vendor_id:,product_id:, name:str}] By doing so, we eliminated the PCI alias. And we call the name in above as a PCI group name. You can think of it as combining the definitions of the existing whitelist and PCI alias. And believe it or not, a PCI group is actually a PCI alias. However, with that change of thinking, a lot of benefits can be harvested: * the implementation is significantly simplified * provisioning is simplified by eliminating the PCI alias * a compute node only needs to report stats with something like: PCI group name:count. A compute node processes all the PCI passthrough devices against the whitelist, and assign a PCI group based on the whitelist definition. * on the controller, we may only need to define the PCI group names. if we use a nova api to define PCI groups (could be private or public, for example), one potential benefit, among other things (validation, etc), they can be owned by the tenant that creates them. And thus a wholesale of PCI passthrough devices is also possible. * scheduler only works with
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi, One use case was brought up in today's meeting that I think is not valid. It is the use case where all 3 vnic types : Virtio, direct and macvtap (the terms used in the meeting were slow, fast, faster/foobar) could be attached to the same VM. The main difference between a direct and macvtap interface is that the former does not support live migration. So, attaching both direct and macvtap pci-passthrough interfaces to the same VM would mean that it cannot support live migration. In that case assigning the macvtap interface is in essence a waste. So, it would be ideal to disallow such an assignment or at least warn the user that the VM will now not be able to support live migration. We can however still combine direct or macvtap pci-passthrough interfaces with virtio vmic types without issue. Thanks, Sandhya From: Ian Wells ijw.ubu...@cack.org.ukmailto:ijw.ubu...@cack.org.uk Reply-To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Date: Thursday, January 9, 2014 12:47 PM To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support I think I'm in agreement with all of this. Nice summary, Robert. It may not be where the work ends, but if we could get this done the rest is just refinement. On 9 January 2014 17:49, Robert Li (baoli) ba...@cisco.commailto:ba...@cisco.com wrote: Hi Folks, With John joining the IRC, so far, we had a couple of productive meetings in an effort to come to consensus and move forward. Thanks John for doing that, and I appreciate everyone's effort to make it to the daily meeting. Let's reconvene on Monday. But before that, and based on our today's conversation on IRC, I'd like to say a few things. I think that first of all, we need to get agreement on the terminologies that we are using so far. With the current nova PCI passthrough PCI whitelist: defines all the available PCI passthrough devices on a compute node. pci_passthrough_whitelist=[{ vendor_id:,product_id:}] PCI Alias: criteria defined on the controller node with which requested PCI passthrough devices can be selected from all the PCI passthrough devices available in a cloud. Currently it has the following format: pci_alias={vendor_id:, product_id:, name:str} nova flavor extra_specs: request for PCI passthrough devices can be specified with extra_specs in the format for example:pci_passthrough:alias=name:count As you can see, currently a PCI alias has a name and is defined on the controller. The implications for it is that when matching it against the PCI devices, it has to match the vendor_id and product_id against all the available PCI devices until one is found. The name is only used for reference in the extra_specs. On the other hand, the whitelist is basically the same as the alias without a name. What we have discussed so far is based on something called PCI groups (or PCI flavors as Yongli puts it). Without introducing other complexities, and with a little change of the above representation, we will have something like: pci_passthrough_whitelist=[{ vendor_id:,product_id:, name:str}] By doing so, we eliminated the PCI alias. And we call the name in above as a PCI group name. You can think of it as combining the definitions of the existing whitelist and PCI alias. And believe it or not, a PCI group is actually a PCI alias. However, with that change of thinking, a lot of benefits can be harvested: * the implementation is significantly simplified * provisioning is simplified by eliminating the PCI alias * a compute node only needs to report stats with something like: PCI group name:count. A compute node processes all the PCI passthrough devices against the whitelist, and assign a PCI group based on the whitelist definition. * on the controller, we may only need to define the PCI group names. if we use a nova api to define PCI groups (could be private or public, for example), one potential benefit, among other things (validation, etc), they can be owned by the tenant that creates them. And thus a wholesale of PCI passthrough devices is also possible. * scheduler only works with PCI group names. * request for PCI passthrough device is based on PCI-group * deployers can provision the cloud based on the PCI groups * Particularly for SRIOV, deployers can design SRIOV PCI groups based on network connectivities. Further, to support SRIOV, we are saying that PCI group names not only can be used in the extra specs, it can also be used in the —nic option and the neutron commands. This allows the most flexibilities and functionalities afforded by SRIOV. Further, we are saying that we
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On 9 January 2014 20:19, Brian Schott brian.sch...@nimbisservices.comwrote: Ian, The idea of pci flavors is a great and using vendor_id and product_id make sense, but I could see a case for adding the class name such as 'VGA compatible controller'. Otherwise, slightly different generations of hardware will mean custom whitelist setups on each compute node. Personally, I think the important thing is to have a matching expression. The more flexible the matching language, the better. On the flip side, vendor_id and product_id might not be sufficient. Suppose I have two identical NICs, one for nova internal use and the second for guest tenants? So, bus numbering may be required. 01:00.0 VGA compatible controller: NVIDIA Corporation G71 [GeForce 7900 GTX] (rev a1) 02:00.0 VGA compatible controller: NVIDIA Corporation G71 [GeForce 7900 GTX] (rev a1) I totally concur on this - with network devices in particular the PCI path is important because you don't accidentally want to grab the Openstack control network device ;) I know you guys are thinking of PCI devices, but any though of mapping to something like udev rather than pci? Supporting udev rules might be easier and more robust rather than making something up. Past experience has told me that udev rules are not actually terribly good, which you soon discover when you have to write expressions like: SUBSYSTEM==net, KERNELS==:83:00.0, ACTION==add, NAME=eth8 which took me a long time to figure out and is self-documenting only in that it has a recognisable PCI path in there, 'KERNELS' not being a meaningful name to me. And self-documenting is key to udev rules, because there's not much information on the tag meanings otherwise. I'm comfortable with having a match format that covers what we know and copes with extension for when we find we're short a feature, and what we have now is close to that. Yes, it needs the class adding, we all agree, and you should be able to match on PCI path, which you can't now, but it's close. -- Ian. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On 9 January 2014 22:50, Ian Wells ijw.ubu...@cack.org.uk wrote: On 9 January 2014 20:19, Brian Schott brian.sch...@nimbisservices.comwrote: On the flip side, vendor_id and product_id might not be sufficient. Suppose I have two identical NICs, one for nova internal use and the second for guest tenants? So, bus numbering may be required. 01:00.0 VGA compatible controller: NVIDIA Corporation G71 [GeForce 7900 GTX] (rev a1) 02:00.0 VGA compatible controller: NVIDIA Corporation G71 [GeForce 7900 GTX] (rev a1) I totally concur on this - with network devices in particular the PCI path is important because you don't accidentally want to grab the Openstack control network device ;) Redundant statement is redundant. Sorry, yes, this has been a pet bugbear of mine. It applies equally to provider networks on the networking side of thing, and, where Neutron is not your network device manager for a PCI device, you may want several device groups bridged to different segments. Network devices are one case of a category of device where there's something about the device that you can't detect that means it's not necessarily interchangeable with its peers. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Robert, sorry that I'm not fan of * your group * term. To me, *your group mixed two thing. It's an extra property provided by configuration, and also it's a very-not-flexible mechanism to select devices (you can only select devices based on the 'group name' property). 1) A dynamic group is much better. For example, user may want to select GPU device based on vendor id, or based on vendor_id+device_id. In another word, user want to create group based on vendor_id, or vendor_id+device_id and select devices from these group. John's proposal is very good, to provide an API to create the PCI flavor(or alias). I prefer flavor because it's more openstack style. 2) As for the second thing of your 'group', I'd understand it as an extra property provided by configuration. I don't think we should put it into the white list, which is to configure devices that are assignable. I'd add another configuration option to provide extra attribute to devices. When nova compute is up, it will parse this configuration and add them to the corresponding PCI devices. I don't think adding another configuration will cause too many trouble to deployment. Openstack already have a lot of configuration items :) 3) I think currently we mixed the neutron and nova design. To me, Neutron SRIOV support is a user of nova PCI support. Thus we should firstly analysis the requirement from neutron PCI support to nova PCI support in a more generic way, and then, we can discuss how we enhance the nova PCI support, or, if you want, re-design the nova PCI support. IMHO, if don't consider network, current implementation should be ok. 4) IMHO, the core for nova PCI support is *PCI property*. The property means not only generic PCI devices like vendor id, device id, device type, compute specific property like BDF address, the adjacent switch IP address, but also user defined property like nuertron's physical net name etc. And then, it's about how to get these property, how to select/group devices based on the property, how to store/fetch these properties. Thanks --jyh From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Thursday, January 09, 2014 8:49 AM To: OpenStack Development Mailing List (not for usage questions); Irena Berezovsky; Sandhya Dasu (sadasu); Jiang, Yunhong; Itzik Brown; j...@johngarbutt.com; He, Yongli Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi Folks, With John joining the IRC, so far, we had a couple of productive meetings in an effort to come to consensus and move forward. Thanks John for doing that, and I appreciate everyone's effort to make it to the daily meeting. Let's reconvene on Monday. But before that, and based on our today's conversation on IRC, I'd like to say a few things. I think that first of all, we need to get agreement on the terminologies that we are using so far. With the current nova PCI passthrough PCI whitelist: defines all the available PCI passthrough devices on a compute node. pci_passthrough_whitelist=[{ vendor_id:,product_id:}] PCI Alias: criteria defined on the controller node with which requested PCI passthrough devices can be selected from all the PCI passthrough devices available in a cloud. Currently it has the following format: pci_alias={vendor_id:, product_id:, name:str} nova flavor extra_specs: request for PCI passthrough devices can be specified with extra_specs in the format for example:pci_passthrough:alias=name:count As you can see, currently a PCI alias has a name and is defined on the controller. The implications for it is that when matching it against the PCI devices, it has to match the vendor_id and product_id against all the available PCI devices until one is found. The name is only used for reference in the extra_specs. On the other hand, the whitelist is basically the same as the alias without a name. What we have discussed so far is based on something called PCI groups (or PCI flavors as Yongli puts it). Without introducing other complexities, and with a little change of the above representation, we will have something like: pci_passthrough_whitelist=[{ vendor_id:,product_id:, name:str}] By doing so, we eliminated the PCI alias. And we call the name in above as a PCI group name. You can think of it as combining the definitions of the existing whitelist and PCI alias. And believe it or not, a PCI group is actually a PCI alias. However, with that change of thinking, a lot of benefits can be harvested: * the implementation is significantly simplified * provisioning is simplified by eliminating the PCI alias * a compute node only needs to report stats with something like: PCI group name:count. A compute node processes all the PCI passthrough devices against the whitelist, and assign a PCI group based on the whitelist definition. * on the controller
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi, Ian, when you in aggrement with all of this, do you agree with the 'group name', or agree with John's pci flavor? I'm against the PCI group and will send out a reply later. --jyh From: Ian Wells [mailto:ijw.ubu...@cack.org.uk] Sent: Thursday, January 09, 2014 9:47 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support I think I'm in agreement with all of this. Nice summary, Robert. It may not be where the work ends, but if we could get this done the rest is just refinement. On 9 January 2014 17:49, Robert Li (baoli) ba...@cisco.commailto:ba...@cisco.com wrote: Hi Folks, With John joining the IRC, so far, we had a couple of productive meetings in an effort to come to consensus and move forward. Thanks John for doing that, and I appreciate everyone's effort to make it to the daily meeting. Let's reconvene on Monday. But before that, and based on our today's conversation on IRC, I'd like to say a few things. I think that first of all, we need to get agreement on the terminologies that we are using so far. With the current nova PCI passthrough PCI whitelist: defines all the available PCI passthrough devices on a compute node. pci_passthrough_whitelist=[{ vendor_id:,product_id:}] PCI Alias: criteria defined on the controller node with which requested PCI passthrough devices can be selected from all the PCI passthrough devices available in a cloud. Currently it has the following format: pci_alias={vendor_id:, product_id:, name:str} nova flavor extra_specs: request for PCI passthrough devices can be specified with extra_specs in the format for example:pci_passthrough:alias=name:count As you can see, currently a PCI alias has a name and is defined on the controller. The implications for it is that when matching it against the PCI devices, it has to match the vendor_id and product_id against all the available PCI devices until one is found. The name is only used for reference in the extra_specs. On the other hand, the whitelist is basically the same as the alias without a name. What we have discussed so far is based on something called PCI groups (or PCI flavors as Yongli puts it). Without introducing other complexities, and with a little change of the above representation, we will have something like: pci_passthrough_whitelist=[{ vendor_id:,product_id:, name:str}] By doing so, we eliminated the PCI alias. And we call the name in above as a PCI group name. You can think of it as combining the definitions of the existing whitelist and PCI alias. And believe it or not, a PCI group is actually a PCI alias. However, with that change of thinking, a lot of benefits can be harvested: * the implementation is significantly simplified * provisioning is simplified by eliminating the PCI alias * a compute node only needs to report stats with something like: PCI group name:count. A compute node processes all the PCI passthrough devices against the whitelist, and assign a PCI group based on the whitelist definition. * on the controller, we may only need to define the PCI group names. if we use a nova api to define PCI groups (could be private or public, for example), one potential benefit, among other things (validation, etc), they can be owned by the tenant that creates them. And thus a wholesale of PCI passthrough devices is also possible. * scheduler only works with PCI group names. * request for PCI passthrough device is based on PCI-group * deployers can provision the cloud based on the PCI groups * Particularly for SRIOV, deployers can design SRIOV PCI groups based on network connectivities. Further, to support SRIOV, we are saying that PCI group names not only can be used in the extra specs, it can also be used in the -nic option and the neutron commands. This allows the most flexibilities and functionalities afforded by SRIOV. Further, we are saying that we can define default PCI groups based on the PCI device's class. For vnic-type (or nic-type), we are saying that it defines the link characteristics of the nic that is attached to a VM: a nic that's connected to a virtual switch, a nic that is connected to a physical switch, or a nic that is connected to a physical switch, but has a host macvtap device in between. The actual names of the choices are not important here, and can be debated. I'm hoping that we can go over the above on Monday. But any comments are welcome by email. Thanks, Robert ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.orgmailto:OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
BTW, I like the PCI flavor :) From: Jiang, Yunhong [mailto:yunhong.ji...@intel.com] Sent: Thursday, January 09, 2014 10:41 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi, Ian, when you in aggrement with all of this, do you agree with the 'group name', or agree with John's pci flavor? I'm against the PCI group and will send out a reply later. --jyh From: Ian Wells [mailto:ijw.ubu...@cack.org.uk] Sent: Thursday, January 09, 2014 9:47 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support I think I'm in agreement with all of this. Nice summary, Robert. It may not be where the work ends, but if we could get this done the rest is just refinement. On 9 January 2014 17:49, Robert Li (baoli) ba...@cisco.commailto:ba...@cisco.com wrote: Hi Folks, With John joining the IRC, so far, we had a couple of productive meetings in an effort to come to consensus and move forward. Thanks John for doing that, and I appreciate everyone's effort to make it to the daily meeting. Let's reconvene on Monday. But before that, and based on our today's conversation on IRC, I'd like to say a few things. I think that first of all, we need to get agreement on the terminologies that we are using so far. With the current nova PCI passthrough PCI whitelist: defines all the available PCI passthrough devices on a compute node. pci_passthrough_whitelist=[{ vendor_id:,product_id:}] PCI Alias: criteria defined on the controller node with which requested PCI passthrough devices can be selected from all the PCI passthrough devices available in a cloud. Currently it has the following format: pci_alias={vendor_id:, product_id:, name:str} nova flavor extra_specs: request for PCI passthrough devices can be specified with extra_specs in the format for example:pci_passthrough:alias=name:count As you can see, currently a PCI alias has a name and is defined on the controller. The implications for it is that when matching it against the PCI devices, it has to match the vendor_id and product_id against all the available PCI devices until one is found. The name is only used for reference in the extra_specs. On the other hand, the whitelist is basically the same as the alias without a name. What we have discussed so far is based on something called PCI groups (or PCI flavors as Yongli puts it). Without introducing other complexities, and with a little change of the above representation, we will have something like: pci_passthrough_whitelist=[{ vendor_id:,product_id:, name:str}] By doing so, we eliminated the PCI alias. And we call the name in above as a PCI group name. You can think of it as combining the definitions of the existing whitelist and PCI alias. And believe it or not, a PCI group is actually a PCI alias. However, with that change of thinking, a lot of benefits can be harvested: * the implementation is significantly simplified * provisioning is simplified by eliminating the PCI alias * a compute node only needs to report stats with something like: PCI group name:count. A compute node processes all the PCI passthrough devices against the whitelist, and assign a PCI group based on the whitelist definition. * on the controller, we may only need to define the PCI group names. if we use a nova api to define PCI groups (could be private or public, for example), one potential benefit, among other things (validation, etc), they can be owned by the tenant that creates them. And thus a wholesale of PCI passthrough devices is also possible. * scheduler only works with PCI group names. * request for PCI passthrough device is based on PCI-group * deployers can provision the cloud based on the PCI groups * Particularly for SRIOV, deployers can design SRIOV PCI groups based on network connectivities. Further, to support SRIOV, we are saying that PCI group names not only can be used in the extra specs, it can also be used in the -nic option and the neutron commands. This allows the most flexibilities and functionalities afforded by SRIOV. Further, we are saying that we can define default PCI groups based on the PCI device's class. For vnic-type (or nic-type), we are saying that it defines the link characteristics of the nic that is attached to a VM: a nic that's connected to a virtual switch, a nic that is connected to a physical switch, or a nic that is connected to a physical switch, but has a host macvtap device in between. The actual names of the choices are not important here, and can be debated. I'm hoping that we can go over the above on Monday. But any comments are welcome by email. Thanks, Robert ___ OpenStack-dev mailing
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On 2014?01?10? 00:49, Robert Li (baoli) wrote: Hi Folks, HI, all basiclly i flavor the pic-flavor style and against massing the white-list. please see my inline comments. With John joining the IRC, so far, we had a couple of productive meetings in an effort to come to consensus and move forward. Thanks John for doing that, and I appreciate everyone's effort to make it to the daily meeting. Let's reconvene on Monday. But before that, and based on our today's conversation on IRC, I'd like to say a few things. I think that first of all, we need to get agreement on the terminologies that we are using so far. With the current nova PCI passthrough PCI whitelist: defines all the available PCI passthrough devices on a compute node. pci_passthrough_whitelist=[{ vendor_id:,product_id:}] PCI Alias: criteria defined on the controller node with which requested PCI passthrough devices can be selected from all the PCI passthrough devices available in a cloud. Currently it has the following format: pci_alias={vendor_id:, product_id:, name:str} nova flavor extra_specs: request for PCI passthrough devices can be specified with extra_specs in the format for example:pci_passthrough:alias=name:count As you can see, currently a PCI alias has a name and is defined on the controller. The implications for it is that when matching it against the PCI devices, it has to match the vendor_id and product_id against all the available PCI devices until one is found. The name is only used for reference in the extra_specs. On the other hand, the whitelist is basically the same as the alias without a name. What we have discussed so far is based on something called PCI groups (or PCI flavors as Yongli puts it). Without introducing other complexities, and with a little change of the above representation, we will have something like: pci_passthrough_whitelist=[{ vendor_id:,product_id:, name:str}] By doing so, we eliminated the PCI alias. And we call the name in above as a PCI group name. You can think of it as combining the definitions of the existing whitelist and PCI alias. And believe it or not, a PCI group is actually a PCI alias. However, with that change of thinking, a lot of the white list configuration is mostly local to a host, so only address in there, like John's proposal is good. mix the group into the whitelist means we make the global thing per host style, this is maybe wrong. benefits can be harvested: * the implementation is significantly simplified but more mass, refer my new patches already sent out. * provisioning is simplified by eliminating the PCI alias pci alias provide a good way to define a global reference-able name for PCI, we need this, this is also true for John's pci-flavor. * a compute node only needs to report stats with something like: PCI group name:count. A compute node processes all the PCI passthrough devices against the whitelist, and assign a PCI group based on the whitelist definition. simplify this seems like good, but it does not, separated the local and global is the instinct nature simplify. * on the controller, we may only need to define the PCI group names. if we use a nova api to define PCI groups (could be private or public, for example), one potential benefit, among other things (validation, etc), they can be owned by the tenant that creates them. And thus a wholesale of PCI passthrough devices is also possible. this mean you should consult the controller to deploy your host, if we keep white-list local, we simplify the deploy. * scheduler only works with PCI group names. * request for PCI passthrough device is based on PCI-group * deployers can provision the cloud based on the PCI groups * Particularly for SRIOV, deployers can design SRIOV PCI groups based on network connectivities. Further, to support SRIOV, we are saying that PCI group names not only can be used in the extra specs, it can also be used in the —nic option and the neutron commands. This allows the most flexibilities and functionalities afforded by SRIOV. i still feel use alias/pci flavor is better solution. Further, we are saying that we can define default PCI groups based on the PCI device's class. default grouping make our conceptual model more mass, pre-define a global thing in API and your hard code is not good way, i post -2 for this. For vnic-type (or nic-type), we are saying that it defines the link characteristics of the nic that is attached to a VM: a nic that's connected to a virtual switch, a nic that is connected to a physical switch, or a nic that is connected to a physical switch, but has a host macvtap device in between. The actual names of the choices are not important here, and can be debated. I'm hoping that we can go over the above on Monday. But any comments are welcome by email.
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On 22 December 2013 12:07, Irena Berezovsky ire...@mellanox.com wrote: Hi Ian, My comments are inline I would like to suggest to focus the next PCI-pass though IRC meeting on: 1.Closing the administration and tenant that powers the VM use cases. 2. Decouple the nova and neutron parts to start focusing on the neutron related details. When is the next meeting? I have lost track due to holidays, etc. John ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi John, We had one on 12/14/2013 with the log: http://eavesdrop.openstack.org/meetings/pci_passthrough_meeting/2013/pci_pa ssthrough_meeting.2013-12-24-14.02.log.html The next one will be at UTC 1400 on Jan. 7th, Tuesday. --Robert On 1/2/14 10:06 AM, John Garbutt j...@johngarbutt.com wrote: On 22 December 2013 12:07, Irena Berezovsky ire...@mellanox.com wrote: Hi Ian, My comments are inline I would like to suggest to focus the next PCI-pass though IRC meeting on: 1.Closing the administration and tenant that powers the VM use cases. 2. Decouple the nova and neutron parts to start focusing on the neutron related details. When is the next meeting? I have lost track due to holidays, etc. John ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi, I would just like to share my idea on somehow managing sr-iov networking attributes in neutron (e.g. mac addr, ip addr, vlan). I've had experience implementing this and that was before pci-passthrough feature in nova existed. Basically, nova still did the plugging and the unplugging of vifs and neutron did all the provisioning of networking attributes. At that time, the best hack I can do was to treat sr-iov nics as ordinary vifs that were distinguishable by nova and neutron. So to implement that, when booting an instance in nova, a certain sr-iov-vf-specific extra_spec was used (e.g. vfs := 1) indicating the number of sr-iov vfs to create and eventually represented as mere vif objects in nova. In nova, the sr-iov vfs were represented as vifs but a special exception was made wherein sr-iov vfs aren't really plugged, because of course it isn't necessary. In effect, the vifs that represent the vfs were accounted in the db including its ip and mac addresses, and vlan tags. With respect to l2 isolation, the vlan tags were retrieved when booting the instance through neutron api and were applied in libvirt xml. To summarize, the networking attributes such as ip and mac addresses and vlan tags were applied normally to vfs and thus preserved the normal OS way of managing these like ordinary vifs. However, since its just a hack, some consequences and issues surfaced such as, proper migration of these networking attributes weren't tested, libvirt seems to mistakenly swap the mac addresses when rebooting the instances, and most importantly the vifs that represented the vfs lack passthrough-specific information. Since today OS already has this concept of PCI-passthrough, I'm thinking this could be combined with the idea of a vf that is represented by a vif to have a complete abstraction of a manageable sr-iov vf. I have not read thoroughly the preceeding replies, so this idea might be redundant or irrelevant already. Cheers, Pepe On Thu, Oct 17, 2013 at 4:32 AM, Irena Berezovsky ire...@mellanox.comwrote: Hi, As one of the next steps for PCI pass-through I would like to discuss is the support for PCI pass-through vNIC. While nova takes care of PCI pass-through device resources management and VIF settings, neutron should manage their networking configuration. I would like to register a summit proposal to discuss the support for PCI pass-through networking. I am not sure what would be the right topic to discuss the PCI pass-through networking, since it involve both nova and neutron. There is already a session registered by Yongli on nova topic to discuss the PCI pass-through next steps. I think PCI pass-through networking is quite a big topic and it worth to have a separate discussion. Is there any other people who are interested to discuss it and share their thoughts and experience? Regards, Irena ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- To stop learning is like to stop loving. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On 12/17/2013 10:09 AM, Ian Wells wrote: Reiterating from the IRC mneeting, largely, so apologies. Firstly, I disagree that https://wiki.openstack.org/wiki/PCI_passthrough_SRIOV_support is an accurate reflection of the current state. It's a very unilateral view, largely because the rest of us had been focussing on the google document that we've been using for weeks. Secondly, I totally disagree with this approach. This assumes that description of the (cloud-internal, hardware) details of each compute node is best done with data stored centrally and driven by an API. I don't agree with either of these points. Firstly, the best place to describe what's available on a compute node is in the configuration on the compute node. For instance, I describe which interfaces do what in Neutron on the compute node. This is because when you're provisioning nodes, that's the moment you know how you've attached it to the network and what hardware you've put in it and what you intend the hardware to be for - or conversely your deployment puppet or chef or whatever knows it, and Razor or MAAS has enumerated it, but the activities are equivalent. Storing it centrally distances the compute node from its descriptive information for no good purpose that I can see and adds the complexity of having to go make remote requests just to start up. Secondly, even if you did store this centrally, it's not clear to me that an API is very useful. As far as I can see, the need for an API is really the need to manage PCI device flavors. If you want that to be API-managed, then the rest of a (rather complex) API cascades from that one choice. Most of the things that API lets you change (expressions describing PCI devices) are the sort of thing that you set once and only revisit when you start - for instance - deploying new hosts in a different way. I at the parallel in Neutron provider networks. They're config driven, largely on the compute hosts. Agents know what ports on their machine (the hardware tie) are associated with provider networks, by provider network name. The controller takes 'neutron net-create ... --provider:network 'name'' and uses that to tie a virtual network to the provider network definition on each host. What we absolutely don't do is have a complex admin API that lets us say 'in host aggregate 4, provider network x (which I made earlier) is connected to eth6'. FWIW, I could not agree more. The Neutron API already suffers from overcomplexity. There's really no need to make it even more complex than it already is, especially for a feature that more naturally fits in configuration data (Puppet/Chef/etc) and isn't something that you would really ever change for a compute host once set. Best, -jay ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi Ian, My comments are inline I would like to suggest to focus the next PCI-pass though IRC meeting on: 1.Closing the administration and tenant that powers the VM use cases. 2. Decouple the nova and neutron parts to start focusing on the neutron related details. BR, Irena From: Ian Wells [mailto:ijw.ubu...@cack.org.uk] Sent: Friday, December 20, 2013 2:50 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support On 19 December 2013 15:15, John Garbutt j...@johngarbutt.commailto:j...@johngarbutt.com wrote: Note, I don't see the person who boots the server ever seeing the pci-flavor, only understanding the server flavor. [IrenaB] I am not sure that elaborating PCI device request into server flavor is the right approach for the PCI pass-through network case. vNIC by its nature is something dynamic that can be plugged or unplugged after VM boot. server flavor is quite static. I was really just meaning the server flavor specify the type of NIC to attach. The existing port specs, etc, define how many nics, and you can hot plug as normal, just the VIF plugger code is told by the server flavor if it is able to PCI passthrough, and which devices it can pick from. The idea being combined with the neturon network-id you know what to plug. The more I talk about this approach the more I hate it :( The thinking we had here is that nova would provide a VIF or a physical NIC for each attachment. Precisely what goes on here is a bit up for grabs, but I would think: Nova specifiies the type at port-update, making it obvious to Neutron it's getting a virtual interface or a passthrough NIC (and the type of that NIC, probably, and likely also the path so that Neutron can distinguish between NICs if it needs to know the specific attachment port) Neutron does its magic on the network if it has any to do, like faffing(*) with switches Neutron selects the VIF/NIC plugging type that Nova should use, and in the case that the NIC is a VF and it wants to set an encap, returns that encap back to Nova Nova plugs it in and sets it up (in libvirt, this is generally in the XML; XenAPI and others are up for grabs). [IrenaB] I agree on the described flow. Still need to close how to elaborate the request for pass-through vNIC into the 'nova boot'. We might also want a nic-flavor that tells neutron information it requires, but lets get to that later... [IrenaB] nic flavor is definitely something that we need in order to choose if high performance (PCI pass-through) or virtio (i.e. OVS) nic will be created. Well, I think its the right way go. Rather than overloading the server flavor with hints about which PCI devices you could use. The issue here is that additional attach. Since for passthrough that isn't NICs (like crypto cards) you would almost certainly specify it in the flavor, if you did the same for NICs then you would have a preallocated pool of NICs from which to draw. The flavor is also all you need to know for billing, and the flavor lets you schedule. If you have it on the list of NICs, you have to work out how many physical NICs you need before you schedule (admittedly not hard, but not in keeping) and if you then did a subsequent attach it could fail because you have no more NICs on the machine you scheduled to - and at this point you're kind of stuck. Also with the former, if you've run out of NICs, the already-extant resize call would allow you to pick a flavor with more NICs and you can then reschedule the subsequent VM to wherever resources are available to fulfil the new request. [IrenaB] Still think that putting PCI NIC request into Server Flavor is not right approach. You will need to create different server flavors per any possible combination of tenant networks attachment options, or maybe assume he is connecting to all. As for billing, you can use type of vNIC in addition to packets in/out for billing per vNIC. This way, tenant will be charged only for used vNICs. One question here is whether Neutron should become a provider of billed resources (specifically passthrough NICs) in the same way as Cinder is of volumes - something we'd not discussed to date; we've largely worked on the assumption that NICs are like any other passthrough resource, just one where, once it's allocated out, Neutron can work magic with it. [IrenaB] I am not so familiar with Ceilometer, but seems that if we are talking about network resources, neutron should be in charge. -- Ian. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Apologies for being late onto this thread, and not making the meeting the other day. Also apologies this is almost totally a top post. On 17 December 2013 15:09, Ian Wells ijw.ubu...@cack.org.uk wrote: Firstly, I disagree that https://wiki.openstack.org/wiki/PCI_passthrough_SRIOV_support is an accurate reflection of the current state. It's a very unilateral view, largely because the rest of us had been focussing on the google document that we've been using for weeks. I haven't seen the google doc. I got involved through the blueprint review of this: https://blueprints.launchpad.net/nova/+spec/pci-extra-info I assume its this one? https://docs.google.com/document/d/1EMwDg9J8zOxzvTnQJ9HwZdiotaVstFWKIuKrPse6JOs On a quick read, my main concern is separating out the user more: * administration (defines pci-flavor, defines which hosts can provide it, defines server flavor...) * person who boots server (picks server flavor, defines neutron ports) Note, I don't see the person who boots the server ever seeing the pci-flavor, only understanding the server flavor. We might also want a nic-flavor that tells neutron information it requires, but lets get to that later... Secondly, I totally disagree with this approach. This assumes that description of the (cloud-internal, hardware) details of each compute node is best done with data stored centrally and driven by an API. I don't agree with either of these points. Possibly, but I would like to first agree on the use cases and data model we want. Nova has generally gone for APIs over config in recent times. Mostly so you can do run-time configuration of the system. But lets just see what makes sense when we have the use cases agreed. On 2013年12月16日 22:27, Robert Li (baoli) wrote: I'd like to give you guy a summary of current state, let's discuss it then. https://wiki.openstack.org/wiki/PCI_passthrough_SRIOV_support 1) fade out alias ( i think this ok for all) 2) white list became pic-flavor ( i think this ok for all) 3) address simply regular expression support: only * and a number range is support [hex-hex]. ( i think this ok?) 4) aggregate : now it's clear enough, and won't impact SRIOV. ( i think this irrelevant to SRIOV now) So... this means we have: PCI-flavor: * i.e. standardGPU, standardGPUnew, fastGPU, hdFlash1TB etc Host mapping: * decide which hosts you allow a particular flavor to be used * note, the scheduler still needs to find out if any devices are free flavor (of the server): * usual RAM, CPU, Storage * use extra specs to add PCI devices * example: ** add one PCI device, choice of standardGPU or standardGPUnew ** also add: one hdFlash1TB Now, the other bit is SRIOV... At a high level: Neutron: * user wants to connect to a particular neutron network * user wants a super-fast SRIOV connection Administration: * needs to map PCI device to what neutron network the connect to The big question is: * is this a specific SRIOV only (provider) network * OR... are other non-SRIOV connections also made to that same network I feel we have to go for that latter. Imagine a network on VLAN 42, you might want some SRIOV into that network, and some OVS connecting into the same network. The user might have VMs connected using both methods, so wants the same IP address ranges and same network id spanning both. If we go for that latter new either need: * some kind of nic-flavor ** boot ... -nic nic-id:public-id:,nic-flavor:10GBpassthrough ** but neutron could store nic-flavor, and pass it through to VIF driver, and user says port-id * OR add NIC config into the server flavor ** extra spec to say, tell VIF driver it could use on of this list of PCI devices: (list pci-flavors) * OR do both I vote for nic-flavor only, because it matches the volume-type we have with cinder. However, it does suggest that Nova should leave all the SRIOV work to the VIF driver. So the VIF driver, as activate by neutron, will understand which PCI devices to passthrough. Similar to the plan with brick, we could have an oslo lib that helps you attach SRIOV devices that could be used by the neturon VIF drivers and the nova PCI passthrough code. Thanks, John ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
John: At a high level: Neutron: * user wants to connect to a particular neutron network * user wants a super-fast SRIOV connection Administration: * needs to map PCI device to what neutron network the connect to The big question is: * is this a specific SRIOV only (provider) network * OR... are other non-SRIOV connections also made to that same network I feel we have to go for that latter. Imagine a network on VLAN 42, you might want some SRIOV into that network, and some OVS connecting into the same network. The user might have VMs connected using both methods, so wants the same IP address ranges and same network id spanning both. If we go for that latter new either need: * some kind of nic-flavor ** boot ... -nic nic-id:public-id:,nic-flavor:10GBpassthrough ** but neutron could store nic-flavor, and pass it through to VIF driver, and user says port-id * OR add NIC config into the server flavor ** extra spec to say, tell VIF driver it could use on of this list of PCI devices: (list pci-flavors) * OR do both I vote for nic-flavor only, because it matches the volume-type we have with cinder. I think the issue there is that Nova is managing the supply of PCI devices (which is limited and limited on a per-machine basis). Indisputably you need to select the NIC you want to use as a passthrough rather than a vnic device, so there's something in the --nic argument, but you have to answer two questions: - how many devices do you need (which is now not a flavor property but in the --nic list, which seems to me an odd place to be defining billable resources) - what happens when someone does nova interface-attach? Cinder's an indirect parallel because the resources it's adding to the hypervisor are virtual and unlimited, I think, or am I missing something here? However, it does suggest that Nova should leave all the SRIOV work to the VIF driver. So the VIF driver, as activate by neutron, will understand which PCI devices to passthrough. Similar to the plan with brick, we could have an oslo lib that helps you attach SRIOV devices that could be used by the neturon VIF drivers and the nova PCI passthrough code. I'm not clear that this is necessary. At the moment with vNICs, you pass through devices by having a co-operation between Neutron (which configures a way of attaching them to put them on a certain network) and the hypervisor specific code (which creates them in the instance and attaches them as instructed by Neutron). Why would we not follow the same pattern with passthrough devices? In this instance, neutron would tell nova that when it's plugging this device it should be a passthrough device, and pass any additional parameters like the VF encap, and Nova would do as instructed, then Neutron would reconfigure whatever parts of the network need to be reconfigured in concert with the hypervisor's settings to make the NIC a part of the specified network. -- Ian. Thanks, John ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On 19 December 2013 12:21, Ian Wells ijw.ubu...@cack.org.uk wrote: John: At a high level: Neutron: * user wants to connect to a particular neutron network * user wants a super-fast SRIOV connection Administration: * needs to map PCI device to what neutron network the connect to The big question is: * is this a specific SRIOV only (provider) network * OR... are other non-SRIOV connections also made to that same network I feel we have to go for that latter. Imagine a network on VLAN 42, you might want some SRIOV into that network, and some OVS connecting into the same network. The user might have VMs connected using both methods, so wants the same IP address ranges and same network id spanning both. If we go for that latter new either need: * some kind of nic-flavor ** boot ... -nic nic-id:public-id:,nic-flavor:10GBpassthrough ** but neutron could store nic-flavor, and pass it through to VIF driver, and user says port-id * OR add NIC config into the server flavor ** extra spec to say, tell VIF driver it could use on of this list of PCI devices: (list pci-flavors) * OR do both I vote for nic-flavor only, because it matches the volume-type we have with cinder. I think the issue there is that Nova is managing the supply of PCI devices (which is limited and limited on a per-machine basis). Indisputably you need to select the NIC you want to use as a passthrough rather than a vnic device, so there's something in the --nic argument, but you have to answer two questions: - how many devices do you need (which is now not a flavor property but in the --nic list, which seems to me an odd place to be defining billable resources) - what happens when someone does nova interface-attach? Agreed. The --nic list specifies how many NICs. I was suggesting adding a nic-flavor on each --nic spec to say if its PCI passthrough vs virtual NIC. Cinder's an indirect parallel because the resources it's adding to the hypervisor are virtual and unlimited, I think, or am I missing something here? I was more referring more to the different volume-types i.e. fast volume or normal volume. And how that is similar to virtual vs fast PCI passthough vs slow PCI passthrough Local volumes probably have the same issues as PCI passthrough with finite resources. But I am not sure we have a good solution for that yet. Mostly, it seems right that Cinder and Neutron own the configuration about the volume and network resources. The VIF driver and volume drivers seem to have a similar sort of relationship with Cinder and Neutron vs Nova. Then the issues boils down to visibility into that data so we can schedule efficiently, which is no easy problem. However, it does suggest that Nova should leave all the SRIOV work to the VIF driver. So the VIF driver, as activate by neutron, will understand which PCI devices to passthrough. Similar to the plan with brick, we could have an oslo lib that helps you attach SRIOV devices that could be used by the neturon VIF drivers and the nova PCI passthrough code. I'm not clear that this is necessary. At the moment with vNICs, you pass through devices by having a co-operation between Neutron (which configures a way of attaching them to put them on a certain network) and the hypervisor specific code (which creates them in the instance and attaches them as instructed by Neutron). Why would we not follow the same pattern with passthrough devices? In this instance, neutron would tell nova that when it's plugging this device it should be a passthrough device, and pass any additional parameters like the VF encap, and Nova would do as instructed, then Neutron would reconfigure whatever parts of the network need to be reconfigured in concert with the hypervisor's settings to make the NIC a part of the specified network. I agree, in general terms. Firstly, do you agree the neutron network-id can be used for passthrough and non-passthrough VIF connections? i.e. a neturon network-id does not imply PCI-passthrough. Secondly, we need to agree on the information flow around defining the flavor of the NIC. i.e. virtual or passthroughFast or passthroughNormal. My gut feeling is that neutron port description should somehow define this via a nic-flavor that maps to a group of pci-flavors. But from a billing point of view, I like the idea of the server flavor saying to the VIF plug code, by the way, for this server, please support all the nics using devices in pciflavor:fastNic should that be possible for the users given port configuration. But this is leaking neutron/networking information into Nova, which seems bad. John ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On 19 December 2013 12:54, John Garbutt j...@johngarbutt.com wrote: On 19 December 2013 12:21, Ian Wells ijw.ubu...@cack.org.uk wrote: John: At a high level: Neutron: * user wants to connect to a particular neutron network * user wants a super-fast SRIOV connection Administration: * needs to map PCI device to what neutron network the connect to The big question is: * is this a specific SRIOV only (provider) network * OR... are other non-SRIOV connections also made to that same network I feel we have to go for that latter. Imagine a network on VLAN 42, you might want some SRIOV into that network, and some OVS connecting into the same network. The user might have VMs connected using both methods, so wants the same IP address ranges and same network id spanning both. If we go for that latter new either need: * some kind of nic-flavor ** boot ... -nic nic-id:public-id:,nic-flavor:10GBpassthrough ** but neutron could store nic-flavor, and pass it through to VIF driver, and user says port-id * OR add NIC config into the server flavor ** extra spec to say, tell VIF driver it could use on of this list of PCI devices: (list pci-flavors) * OR do both I vote for nic-flavor only, because it matches the volume-type we have with cinder. I think the issue there is that Nova is managing the supply of PCI devices (which is limited and limited on a per-machine basis). Indisputably you need to select the NIC you want to use as a passthrough rather than a vnic device, so there's something in the --nic argument, but you have to answer two questions: - how many devices do you need (which is now not a flavor property but in the --nic list, which seems to me an odd place to be defining billable resources) - what happens when someone does nova interface-attach? Agreed. Apologies, I misread what you put, maybe we don't agree... I am just trying not to make a passthrough NIC and odd special case. In my mind, it should just be a regular neturon port connection that happens to be implemented using PCI passthrough. I agree we need to sort out the scheduling of that, because its a finite resource. The --nic list specifies how many NICs. I was suggesting adding a nic-flavor on each --nic spec to say if its PCI passthrough vs virtual NIC. Cinder's an indirect parallel because the resources it's adding to the hypervisor are virtual and unlimited, I think, or am I missing something here? I was more referring more to the different volume-types i.e. fast volume or normal volume. And how that is similar to virtual vs fast PCI passthough vs slow PCI passthrough Local volumes probably have the same issues as PCI passthrough with finite resources. But I am not sure we have a good solution for that yet. Mostly, it seems right that Cinder and Neutron own the configuration about the volume and network resources. The VIF driver and volume drivers seem to have a similar sort of relationship with Cinder and Neutron vs Nova. Then the issues boils down to visibility into that data so we can schedule efficiently, which is no easy problem. However, it does suggest that Nova should leave all the SRIOV work to the VIF driver. So the VIF driver, as activate by neutron, will understand which PCI devices to passthrough. Similar to the plan with brick, we could have an oslo lib that helps you attach SRIOV devices that could be used by the neturon VIF drivers and the nova PCI passthrough code. I'm not clear that this is necessary. At the moment with vNICs, you pass through devices by having a co-operation between Neutron (which configures a way of attaching them to put them on a certain network) and the hypervisor specific code (which creates them in the instance and attaches them as instructed by Neutron). Why would we not follow the same pattern with passthrough devices? In this instance, neutron would tell nova that when it's plugging this device it should be a passthrough device, and pass any additional parameters like the VF encap, and Nova would do as instructed, then Neutron would reconfigure whatever parts of the network need to be reconfigured in concert with the hypervisor's settings to make the NIC a part of the specified network. I agree, in general terms. Firstly, do you agree the neutron network-id can be used for passthrough and non-passthrough VIF connections? i.e. a neturon network-id does not imply PCI-passthrough. Secondly, we need to agree on the information flow around defining the flavor of the NIC. i.e. virtual or passthroughFast or passthroughNormal. My gut feeling is that neutron port description should somehow define this via a nic-flavor that maps to a group of pci-flavors. But from a billing point of view, I like the idea of the server flavor saying to the VIF plug code, by the way, for this server, please support all the nics using devices in pciflavor:fastNic should that be possible for the users given port
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi John, I totally agree that we should define the use cases both for administration and tenant that powers the VM. Since we are trying to support PCI pass-through network, let's focus on the related use cases. Please see my comments inline. Regards, Irena -Original Message- From: John Garbutt [mailto:j...@johngarbutt.com] Sent: Thursday, December 19, 2013 1:42 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Apologies for being late onto this thread, and not making the meeting the other day. Also apologies this is almost totally a top post. On 17 December 2013 15:09, Ian Wells ijw.ubu...@cack.org.uk wrote: Firstly, I disagree that https://wiki.openstack.org/wiki/PCI_passthrough_SRIOV_support is an accurate reflection of the current state. It's a very unilateral view, largely because the rest of us had been focussing on the google document that we've been using for weeks. I haven't seen the google doc. I got involved through the blueprint review of this: https://blueprints.launchpad.net/nova/+spec/pci-extra-info I assume its this one? https://docs.google.com/document/d/1EMwDg9J8zOxzvTnQJ9HwZdiotaVstFWKIuKrPse6JOs On a quick read, my main concern is separating out the user more: * administration (defines pci-flavor, defines which hosts can provide it, defines server flavor...) * person who boots server (picks server flavor, defines neutron ports) Note, I don't see the person who boots the server ever seeing the pci-flavor, only understanding the server flavor. [IrenaB] I am not sure that elaborating PCI device request into server flavor is the right approach for the PCI pass-through network case. vNIC by its nature is something dynamic that can be plugged or unplugged after VM boot. server flavor is quite static. We might also want a nic-flavor that tells neutron information it requires, but lets get to that later... [IrenaB] nic flavor is definitely something that we need in order to choose if high performance (PCI pass-through) or virtio (i.e. OVS) nic will be created. Secondly, I totally disagree with this approach. This assumes that description of the (cloud-internal, hardware) details of each compute node is best done with data stored centrally and driven by an API. I don't agree with either of these points. Possibly, but I would like to first agree on the use cases and data model we want. Nova has generally gone for APIs over config in recent times. Mostly so you can do run-time configuration of the system. But lets just see what makes sense when we have the use cases agreed. On 2013年12月16日 22:27, Robert Li (baoli) wrote: I'd like to give you guy a summary of current state, let's discuss it then. https://wiki.openstack.org/wiki/PCI_passthrough_SRIOV_support 1) fade out alias ( i think this ok for all) 2) white list became pic-flavor ( i think this ok for all) 3) address simply regular expression support: only * and a number range is support [hex-hex]. ( i think this ok?) 4) aggregate : now it's clear enough, and won't impact SRIOV. ( i think this irrelevant to SRIOV now) So... this means we have: PCI-flavor: * i.e. standardGPU, standardGPUnew, fastGPU, hdFlash1TB etc Host mapping: * decide which hosts you allow a particular flavor to be used * note, the scheduler still needs to find out if any devices are free flavor (of the server): * usual RAM, CPU, Storage * use extra specs to add PCI devices * example: ** add one PCI device, choice of standardGPU or standardGPUnew ** also add: one hdFlash1TB Now, the other bit is SRIOV... At a high level: Neutron: * user wants to connect to a particular neutron network * user wants a super-fast SRIOV connection Administration: * needs to map PCI device to what neutron network the connect to The big question is: * is this a specific SRIOV only (provider) network * OR... are other non-SRIOV connections also made to that same network I feel we have to go for that latter. Imagine a network on VLAN 42, you might want some SRIOV into that network, and some OVS connecting into the same network. The user might have VMs connected using both methods, so wants the same IP address ranges and same network id spanning both. [IrenaB] Agree. SRIOV connection is the choice for certain VM on certain network. The same VM can be connected to other network via virtio nic as well as other VMs can be connected to the same network via virtio nics. If we go for that latter new either need: * some kind of nic-flavor ** boot ... -nic nic-id:public-id:,nic-flavor:10GBpassthrough ** but neutron could store nic-flavor, and pass it through to VIF driver, and user says port-id * OR add NIC config into the server flavor ** extra spec to say, tell VIF driver it could use on of this list of PCI devices: (list pci-flavors) * OR do both I vote for nic-flavor only, because it matches the volume-type we have
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Response inline... On 19 December 2013 13:05, Irena Berezovsky ire...@mellanox.com wrote: Hi John, I totally agree that we should define the use cases both for administration and tenant that powers the VM. Since we are trying to support PCI pass-through network, let's focus on the related use cases. Please see my comments inline. Cool. Regards, Irena -Original Message- From: John Garbutt [mailto:j...@johngarbutt.com] Sent: Thursday, December 19, 2013 1:42 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Apologies for being late onto this thread, and not making the meeting the other day. Also apologies this is almost totally a top post. On 17 December 2013 15:09, Ian Wells ijw.ubu...@cack.org.uk wrote: Firstly, I disagree that https://wiki.openstack.org/wiki/PCI_passthrough_SRIOV_support is an accurate reflection of the current state. It's a very unilateral view, largely because the rest of us had been focussing on the google document that we've been using for weeks. I haven't seen the google doc. I got involved through the blueprint review of this: https://blueprints.launchpad.net/nova/+spec/pci-extra-info I assume its this one? https://docs.google.com/document/d/1EMwDg9J8zOxzvTnQJ9HwZdiotaVstFWKIuKrPse6JOs On a quick read, my main concern is separating out the user more: * administration (defines pci-flavor, defines which hosts can provide it, defines server flavor...) * person who boots server (picks server flavor, defines neutron ports) Note, I don't see the person who boots the server ever seeing the pci-flavor, only understanding the server flavor. [IrenaB] I am not sure that elaborating PCI device request into server flavor is the right approach for the PCI pass-through network case. vNIC by its nature is something dynamic that can be plugged or unplugged after VM boot. server flavor is quite static. I was really just meaning the server flavor specify the type of NIC to attach. The existing port specs, etc, define how many nics, and you can hot plug as normal, just the VIF plugger code is told by the server flavor if it is able to PCI passthrough, and which devices it can pick from. The idea being combined with the neturon network-id you know what to plug. The more I talk about this approach the more I hate it :( We might also want a nic-flavor that tells neutron information it requires, but lets get to that later... [IrenaB] nic flavor is definitely something that we need in order to choose if high performance (PCI pass-through) or virtio (i.e. OVS) nic will be created. Well, I think its the right way go. Rather than overloading the server flavor with hints about which PCI devices you could use. Secondly, I totally disagree with this approach. This assumes that description of the (cloud-internal, hardware) details of each compute node is best done with data stored centrally and driven by an API. I don't agree with either of these points. Possibly, but I would like to first agree on the use cases and data model we want. Nova has generally gone for APIs over config in recent times. Mostly so you can do run-time configuration of the system. But lets just see what makes sense when we have the use cases agreed. On 2013年12月16日 22:27, Robert Li (baoli) wrote: I'd like to give you guy a summary of current state, let's discuss it then. https://wiki.openstack.org/wiki/PCI_passthrough_SRIOV_support 1) fade out alias ( i think this ok for all) 2) white list became pic-flavor ( i think this ok for all) 3) address simply regular expression support: only * and a number range is support [hex-hex]. ( i think this ok?) 4) aggregate : now it's clear enough, and won't impact SRIOV. ( i think this irrelevant to SRIOV now) So... this means we have: PCI-flavor: * i.e. standardGPU, standardGPUnew, fastGPU, hdFlash1TB etc Host mapping: * decide which hosts you allow a particular flavor to be used * note, the scheduler still needs to find out if any devices are free flavor (of the server): * usual RAM, CPU, Storage * use extra specs to add PCI devices * example: ** add one PCI device, choice of standardGPU or standardGPUnew ** also add: one hdFlash1TB Now, the other bit is SRIOV... At a high level: Neutron: * user wants to connect to a particular neutron network * user wants a super-fast SRIOV connection Administration: * needs to map PCI device to what neutron network the connect to The big question is: * is this a specific SRIOV only (provider) network * OR... are other non-SRIOV connections also made to that same network I feel we have to go for that latter. Imagine a network on VLAN 42, you might want some SRIOV into that network, and some OVS connecting into the same network. The user might have VMs connected using both methods, so wants the same IP address ranges and same
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On 19 December 2013 15:15, John Garbutt j...@johngarbutt.com wrote: Note, I don't see the person who boots the server ever seeing the pci-flavor, only understanding the server flavor. [IrenaB] I am not sure that elaborating PCI device request into server flavor is the right approach for the PCI pass-through network case. vNIC by its nature is something dynamic that can be plugged or unplugged after VM boot. server flavor is quite static. I was really just meaning the server flavor specify the type of NIC to attach. The existing port specs, etc, define how many nics, and you can hot plug as normal, just the VIF plugger code is told by the server flavor if it is able to PCI passthrough, and which devices it can pick from. The idea being combined with the neturon network-id you know what to plug. The more I talk about this approach the more I hate it :( The thinking we had here is that nova would provide a VIF or a physical NIC for each attachment. Precisely what goes on here is a bit up for grabs, but I would think: Nova specifiies the type at port-update, making it obvious to Neutron it's getting a virtual interface or a passthrough NIC (and the type of that NIC, probably, and likely also the path so that Neutron can distinguish between NICs if it needs to know the specific attachment port) Neutron does its magic on the network if it has any to do, like faffing(*) with switches Neutron selects the VIF/NIC plugging type that Nova should use, and in the case that the NIC is a VF and it wants to set an encap, returns that encap back to Nova Nova plugs it in and sets it up (in libvirt, this is generally in the XML; XenAPI and others are up for grabs). We might also want a nic-flavor that tells neutron information it requires, but lets get to that later... [IrenaB] nic flavor is definitely something that we need in order to choose if high performance (PCI pass-through) or virtio (i.e. OVS) nic will be created. Well, I think its the right way go. Rather than overloading the server flavor with hints about which PCI devices you could use. The issue here is that additional attach. Since for passthrough that isn't NICs (like crypto cards) you would almost certainly specify it in the flavor, if you did the same for NICs then you would have a preallocated pool of NICs from which to draw. The flavor is also all you need to know for billing, and the flavor lets you schedule. If you have it on the list of NICs, you have to work out how many physical NICs you need before you schedule (admittedly not hard, but not in keeping) and if you then did a subsequent attach it could fail because you have no more NICs on the machine you scheduled to - and at this point you're kind of stuck. Also with the former, if you've run out of NICs, the already-extant resize call would allow you to pick a flavor with more NICs and you can then reschedule the subsequent VM to wherever resources are available to fulfil the new request. One question here is whether Neutron should become a provider of billed resources (specifically passthrough NICs) in the same way as Cinder is of volumes - something we'd not discussed to date; we've largely worked on the assumption that NICs are like any other passthrough resource, just one where, once it's allocated out, Neutron can work magic with it. -- Ian. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Port profile is generic way of Neutron to pass plugin-specific data as dictionary. Cisco plugin uses it to pass VMEFX specific data. Robert, correct me if I'm wrong. thanks, --- Isaku Yamahata isaku.yamah...@gmail.com On Thu, Oct 31, 2013 at 10:21:20PM +, Jiang, Yunhong yunhong.ji...@intel.com wrote: Robert, I think your change request for pci alias should be covered by the extra infor enhancement. https://blueprints.launchpad.net/nova/+spec/pci-extra-info and Yongli is working on it. I'm not sure how the port profile is passed to the connected switch, is it a Cisco VMEFX specific method or libvirt method? Sorry I'm not well on network side. --jyh From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Wednesday, October 30, 2013 10:13 AM To: Irena Berezovsky; Jiang, Yunhong; prashant.upadhy...@aricent.com; chris.frie...@windriver.com; He, Yongli; Itzik Brown Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi, Regarding physical network mapping, This is what I thought. consider the following scenarios: 1. a compute node with SRIOV only interfaces attached to a physical network. the node is connected to one upstream switch 2. a compute node with both SRIOV interfaces and non-SRIOV interfaces attached to a physical network. the node is connected to one upstream switch 3. in addition to case 1 2, a compute node may have multiple vNICs that are connected to different upstream switches. CASE 1: -- the mapping from a virtual network (in terms of neutron) to a physical network is actually done by binding a port profile to a neutron port. With cisco's VM-FEX, a port profile is associated with one or multiple vlans. Once the neutron port is bound with this port-profile in the upstream switch, it's effectively plugged into the physical network. -- since the compute node is connected to one upstream switch, the existing nova PCI alias will be sufficient. For example, one can boot a Nova instance that is attached to a SRIOV port with the following command: nova boot -flavor m1.large -image image-id --nic net-id=net,pci-alias=alias,sriov=direct|macvtap,port-profile=profile the net-id will be useful in terms of allocating IP address, enable dhcp, etc that is associated with the network. -- the pci-alias specified in the nova boot command is used to create a PCI request for scheduling purpose. a PCI device is bound to a neutron port during the instance build time in the case of nova boot. Before invoking the neutron API to create a port, an allocated PCI device out of a PCI alias will be located from the PCI device list object. This device info among other information will be sent to neutron to create the port. CASE 2: -- Assume that OVS is used for the non-SRIOV interfaces. An example of configuration with ovs plugin would look like: bridge_mappings = physnet1:br-vmfex network_vlan_ranges = physnet1:15:17 tenant_network_type = vlan When a neutron network is created, a vlan is either allocated or specified in the neutron net-create command. Attaching a physical interface to the bridge (in the above example br-vmfex) is an administrative task. -- to create a Nova instance with non-SRIOV port: nova boot -flavor m1.large -image image-id --nic net-id=net -- to create a Nova instance with SRIOV port: nova boot -flavor m1.large -image image-id --nic net-id=net,pci-alias=alias,sriov=direct|macvtap,port-profile=profile it's essentially the same as in the first case. But since the net-id is already associated with a vlan, the vlan associated with the port-profile must be identical to that vlan. This has to be enforced by neutron. again, since the node is connected to one upstream switch, the existing nova PCI alias should be sufficient. CASE 3: -- A compute node might be connected to multiple upstream switches, with each being a separate network. This means SRIOV PFs/VFs are already implicitly associated with physical networks. In the none-SRIOV case, a physical interface is associated with a physical network by plugging it into that network, and attaching this interface to the ovs bridge that represents this physical network on the compute node. In the SRIOV case, we need a way to group the SRIOV VFs that belong to the same physical networks. The existing nova PCI alias is to facilitate PCI device allocation by associating product_id, vendor_id with an alias name. This will no longer be sufficient. But it can be enhanced to achieve our goal. For example, the PCI device domain, bus (if their mapping to vNIC is fixed across boot) may be added into the alias, and the alias name should be corresponding to a list of tuples. Another consideration
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On Wed, Oct 30, 2013 at 04:14:40AM +, Jiang, Yunhong yunhong.ji...@intel.com wrote: But how about long term direction? Neutron should know/manage such network related resources on compute nodes? So you mean the PCI device management will be spited between Nova and Neutron? For example, non-NIC device owned by nova and NIC device owned by neutron? Yes. But I'd like to hear from other Neutron developers. There have been so many discussion of the scheduler enhancement, like https://etherpad.openstack.org/p/grizzly-split-out-scheduling , so possibly that's the right direction? Let's wait for the summit discussion. Interesting. Yeah, I look forward for the summit discussion. Let's try to involve not only Nova developers, but also other Neutron developers. thanks, -- Isaku Yamahata isaku.yamah...@gmail.com ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
. Regards, Irena From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Friday, October 25, 2013 11:16 PM To: prashant.upadhy...@aricent.commailto:prashant.upadhy...@aricent.com; Irena Berezovsky; yunhong.ji...@intel.commailto:yunhong.ji...@intel.com; chris.frie...@windriver.commailto:chris.frie...@windriver.com; yongli...@intel.commailto:yongli...@intel.com Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi Irena, This is Robert Li from Cisco Systems. Recently, I was tasked to investigate such support for Cisco's systems that support VM-FEX, which is a SRIOV technology supporting 802-1Qbh. I was able to bring up nova instances with SRIOV interfaces, and establish networking in between the instances that employes the SRIOV interfaces. Certainly, this was accomplished with hacking and some manual intervention. Based on this experience and my study with the two existing nova pci-passthrough blueprints that have been implemented and committed into Havana (https://blueprints.launchpad.net/nova/+spec/pci-passthrough-base and https://blueprints.launchpad.net/nova/+spec/pci-passthrough-libvirt), I registered a couple of blueprints (one on Nova side, the other on the Neutron side): https://blueprints.launchpad.net/nova/+spec/pci-passthrough-sriov https://blueprints.launchpad.net/neutron/+spec/pci-passthrough-sriov in order to address SRIOV support in openstack. Please take a look at them and see if they make sense, and let me know any comments and questions. We can also discuss this in the summit, I suppose. I noticed that there is another thread on this topic, so copy those folks from that thread as well. thanks, Robert On 10/16/13 4:32 PM, Irena Berezovsky ire...@mellanox.commailto:ire...@mellanox.com wrote: Hi, As one of the next steps for PCI pass-through I would like to discuss is the support for PCI pass-through vNIC. While nova takes care of PCI pass-through device resources management and VIF settings, neutron should manage their networking configuration. I would like to register asummit proposal to discuss the support for PCI pass-through networking. I am not sure what would be the right topic to discuss the PCI pass-through networking, since it involve both nova and neutron. There is already a session registered by Yongli on nova topic to discuss the PCI pass-through next steps. I think PCI pass-through networking is quite a big topic and it worth to have a separate discussion. Is there any other people who are interested to discuss it and share their thoughts and experience? Regards, Irena ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Robert, is it possible to have a IRC meeting? I'd prefer to IRC meeting because it's more openstack style and also can keep the minutes clearly. To your flow, can you give more detailed example. For example, I can consider user specify the instance with -nic option specify a network id, and then how nova device the requirement to the PCI device? I assume the network id should define the switches that the device can connect to , but how is that information translated to the PCI property requirement? Will this translation happen before the nova scheduler make host decision? Thanks --jyh From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Monday, October 28, 2013 12:22 PM To: Irena Berezovsky; prashant.upadhy...@aricent.com; Jiang, Yunhong; chris.frie...@windriver.com; He, Yongli; Itzik Brown Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi Irena, Thank you very much for your comments. See inline. --Robert On 10/27/13 3:48 AM, Irena Berezovsky ire...@mellanox.commailto:ire...@mellanox.com wrote: Hi Robert, Thank you very much for sharing the information regarding your efforts. Can you please share your idea of the end to end flow? How do you suggest to bind Nova and Neutron? The end to end flow is actually encompassed in the blueprints in a nutshell. I will reiterate it in below. The binding between Nova and Neutron occurs with the neutron v2 API that nova invokes in order to provision the neutron services. The vif driver is responsible for plugging in an instance onto the networking setup that neutron has created on the host. Normally, one will invoke nova boot api with the -nic options to specify the nic with which the instance will be connected to the network. It currently allows net-id, fixed ip and/or port-id to be specified for the option. However, it doesn't allow one to specify special networking requirements for the instance. Thanks to the nova pci-passthrough work, one can specify PCI passthrough device(s) in the nova flavor. But it doesn't provide means to tie up these PCI devices in the case of ethernet adpators with networking services. Therefore the idea is actually simple as indicated by the blueprint titles, to provide means to tie up SRIOV devices with neutron services. A work flow would roughly look like this for 'nova boot': -- Specifies networking requirements in the -nic option. Specifically for SRIOV, allow the following to be specified in addition to the existing required information: . PCI alias . direct pci-passthrough/macvtap . port profileid that is compliant with 802.1Qbh The above information is optional. In the absence of them, the existing behavior remains. -- if special networking requirements exist, Nova api creates PCI requests in the nova instance type for scheduling purpose -- Nova scheduler schedules the instance based on the requested flavor plus the PCI requests that are created for networking. -- Nova compute invokes neutron services with PCI passthrough information if any -- Neutron performs its normal operations based on the request, such as allocating a port, assigning ip addresses, etc. Specific to SRIOV, it should validate the information such as profileid, and stores them in its db. It's also possible to associate a port profileid with a neutron network so that port profileid becomes optional in the -nic option. Neutron returns nova the port information, especially for PCI passthrough related information in the port binding object. Currently, the port binding object contains the following information: binding:vif_type binding:host_id binding:profile binding:capabilities -- nova constructs the domain xml and plug in the instance by calling the vif driver. The vif driver can build up the interface xml based on the port binding information. The blueprints you registered make sense. On Nova side, there is a need to bind between requested virtual network and PCI device/interface to be allocated as vNIC. On the Neutron side, there is a need to support networking configuration of the vNIC. Neutron should be able to identify the PCI device/macvtap interface in order to apply configuration. I think it makes sense to provide neutron integration via dedicated Modular Layer 2 Mechanism Driver to allow PCI pass-through vNIC support along with other networking technologies. I haven't sorted through this yet. A neutron port could be associated with a PCI device or not, which is a common feature, IMHO. However, a ML2 driver may be needed specific to a particular SRIOV technology. During the Havana Release, we introduced Mellanox Neutron plugin that enables networking via SRIOV pass-through devices or macvtap interfaces. We want to integrate
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi Jiang, Robert, IRC meeting option works for me. If I understand your question below, you are looking for a way to tie up between requested virtual network(s) and requested PCI device(s). The way we did it in our solution is to map a provider:physical_network to an interface that represents the Physical Function. Every virtual network is bound to the provider:physical_network, so the PCI device should be allocated based on this mapping. We can map a PCI alias to the provider:physical_network. Another topic to discuss is where the mapping between neutron port and PCI device should be managed. One way to solve it, is to propagate the allocated PCI device details to neutron on port creation. In case there is no qbg/qbh support, VF networking configuration should be applied locally on the Host. The question is when and how to apply networking configuration on the PCI device? We see the following options: * it can be done on port creation. * It can be done when nova VIF driver is called for vNIC plugging. This will require to have all networking configuration available to the VIF driver or send request to the neutron server to obtain it. * It can be done by having a dedicated L2 neutron agent on each Host that scans for allocated PCI devices and then retrieves networking configuration from the server and configures the device. The agent will be also responsible for managing update requests coming from the neutron server. For macvtap vNIC type assignment, the networking configuration can be applied by a dedicated L2 neutron agent. BR, Irena From: Jiang, Yunhong [mailto:yunhong.ji...@intel.com] Sent: Tuesday, October 29, 2013 9:04 AM To: Robert Li (baoli); Irena Berezovsky; prashant.upadhy...@aricent.com; chris.frie...@windriver.com; He, Yongli; Itzik Brown Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) Subject: RE: [openstack-dev] [nova] [neutron] PCI pass-through network support Robert, is it possible to have a IRC meeting? I'd prefer to IRC meeting because it's more openstack style and also can keep the minutes clearly. To your flow, can you give more detailed example. For example, I can consider user specify the instance with -nic option specify a network id, and then how nova device the requirement to the PCI device? I assume the network id should define the switches that the device can connect to , but how is that information translated to the PCI property requirement? Will this translation happen before the nova scheduler make host decision? Thanks --jyh From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Monday, October 28, 2013 12:22 PM To: Irena Berezovsky; prashant.upadhy...@aricent.commailto:prashant.upadhy...@aricent.com; Jiang, Yunhong; chris.frie...@windriver.commailto:chris.frie...@windriver.com; He, Yongli; Itzik Brown Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi Irena, Thank you very much for your comments. See inline. --Robert On 10/27/13 3:48 AM, Irena Berezovsky ire...@mellanox.commailto:ire...@mellanox.com wrote: Hi Robert, Thank you very much for sharing the information regarding your efforts. Can you please share your idea of the end to end flow? How do you suggest to bind Nova and Neutron? The end to end flow is actually encompassed in the blueprints in a nutshell. I will reiterate it in below. The binding between Nova and Neutron occurs with the neutron v2 API that nova invokes in order to provision the neutron services. The vif driver is responsible for plugging in an instance onto the networking setup that neutron has created on the host. Normally, one will invoke nova boot api with the -nic options to specify the nic with which the instance will be connected to the network. It currently allows net-id, fixed ip and/or port-id to be specified for the option. However, it doesn't allow one to specify special networking requirements for the instance. Thanks to the nova pci-passthrough work, one can specify PCI passthrough device(s) in the nova flavor. But it doesn't provide means to tie up these PCI devices in the case of ethernet adpators with networking services. Therefore the idea is actually simple as indicated by the blueprint titles, to provide means to tie up SRIOV devices with neutron services. A work flow would roughly look like this for 'nova boot': -- Specifies networking requirements in the -nic option. Specifically for SRIOV, allow the following to be specified in addition to the existing required information: . PCI alias . direct pci-passthrough/macvtap . port profileid that is compliant with 802.1Qbh The above information is optional. In the absence of them, the existing behavior remains
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
I would love to see a symmetry between Cinder local volumes and Neutron PCI passthrough VIFs. Not entirely sure I have that clear in my head right now, but I just wanted to share the idea: * describe resource external to nova that is attached to VM in the API (block device mapping and/or vif references) * ideally the nova scheduler needs to be aware of the local capacity, and how that relates to the above information (relates to the cross service scheduling issues) * state of the device should be stored by Neutron/Cinder (attached/detached, capacity, IP, etc), but still exposed to the scheduler * connection params get given to Nova from Neutron/Cinder * nova still has the vif driver or volume driver to make the final connection * the disk should be formatted/expanded, and network info injected in the same way as before (cloud-init, config drive, DHCP, etc) John On 29 October 2013 10:17, Irena Berezovsky ire...@mellanox.com wrote: Hi Jiang, Robert, IRC meeting option works for me. If I understand your question below, you are looking for a way to tie up between requested virtual network(s) and requested PCI device(s). The way we did it in our solution is to map a provider:physical_network to an interface that represents the Physical Function. Every virtual network is bound to the provider:physical_network, so the PCI device should be allocated based on this mapping. We can map a PCI alias to the provider:physical_network. Another topic to discuss is where the mapping between neutron port and PCI device should be managed. One way to solve it, is to propagate the allocated PCI device details to neutron on port creation. In case there is no qbg/qbh support, VF networking configuration should be applied locally on the Host. The question is when and how to apply networking configuration on the PCI device? We see the following options: · it can be done on port creation. · It can be done when nova VIF driver is called for vNIC plugging. This will require to have all networking configuration available to the VIF driver or send request to the neutron server to obtain it. · It can be done by having a dedicated L2 neutron agent on each Host that scans for allocated PCI devices and then retrieves networking configuration from the server and configures the device. The agent will be also responsible for managing update requests coming from the neutron server. For macvtap vNIC type assignment, the networking configuration can be applied by a dedicated L2 neutron agent. BR, Irena From: Jiang, Yunhong [mailto:yunhong.ji...@intel.com] Sent: Tuesday, October 29, 2013 9:04 AM To: Robert Li (baoli); Irena Berezovsky; prashant.upadhy...@aricent.com; chris.frie...@windriver.com; He, Yongli; Itzik Brown Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) Subject: RE: [openstack-dev] [nova] [neutron] PCI pass-through network support Robert, is it possible to have a IRC meeting? I’d prefer to IRC meeting because it’s more openstack style and also can keep the minutes clearly. To your flow, can you give more detailed example. For example, I can consider user specify the instance with –nic option specify a network id, and then how nova device the requirement to the PCI device? I assume the network id should define the switches that the device can connect to , but how is that information translated to the PCI property requirement? Will this translation happen before the nova scheduler make host decision? Thanks --jyh From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Monday, October 28, 2013 12:22 PM To: Irena Berezovsky; prashant.upadhy...@aricent.com; Jiang, Yunhong; chris.frie...@windriver.com; He, Yongli; Itzik Brown Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi Irena, Thank you very much for your comments. See inline. --Robert On 10/27/13 3:48 AM, Irena Berezovsky ire...@mellanox.com wrote: Hi Robert, Thank you very much for sharing the information regarding your efforts. Can you please share your idea of the end to end flow? How do you suggest to bind Nova and Neutron? The end to end flow is actually encompassed in the blueprints in a nutshell. I will reiterate it in below. The binding between Nova and Neutron occurs with the neutron v2 API that nova invokes in order to provision the neutron services. The vif driver is responsible for plugging in an instance onto the networking setup that neutron has created on the host. Normally, one will invoke nova boot api with the —nic options to specify the nic with which the instance will be connected to the network. It currently allows net-id, fixed ip and/or port-id to be specified for the option. However, it doesn't
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi, sounds like there are enough interests for an IRC meeting before the summit. Do you guys know how to schedule a #openstack IRC meeting? thanks, Robert On 10/29/13 6:17 AM, Irena Berezovsky ire...@mellanox.commailto:ire...@mellanox.com wrote: Hi Jiang, Robert, IRC meeting option works for me. If I understand your question below, you are looking for a way to tie up between requested virtual network(s) and requested PCI device(s). The way we did it in our solution is to map a provider:physical_network to an interface that represents the Physical Function. Every virtual network is bound to the provider:physical_network, so the PCI device should be allocated based on this mapping. We can map a PCI alias to the provider:physical_network. Another topic to discuss is where the mapping between neutron port and PCI device should be managed. One way to solve it, is to propagate the allocated PCI device details to neutron on port creation. In case there is no qbg/qbh support, VF networking configuration should be applied locally on the Host. The question is when and how to apply networking configuration on the PCI device? We see the following options: · it can be done on port creation. · It can be done when nova VIF driver is called for vNIC plugging. This will require to have all networking configuration available to the VIF driver or send request to the neutron server to obtain it. · It can be done by having a dedicated L2 neutron agent on each Host that scans for allocated PCI devices and then retrieves networking configuration from the server and configures the device. The agent will be also responsible for managing update requests coming from the neutron server. For macvtap vNIC type assignment, the networking configuration can be applied by a dedicated L2 neutron agent. BR, Irena From: Jiang, Yunhong [mailto:yunhong.ji...@intel.com] Sent: Tuesday, October 29, 2013 9:04 AM To: Robert Li (baoli); Irena Berezovsky; prashant.upadhy...@aricent.commailto:prashant.upadhy...@aricent.com; chris.frie...@windriver.commailto:chris.frie...@windriver.com; He, Yongli; Itzik Brown Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) Subject: RE: [openstack-dev] [nova] [neutron] PCI pass-through network support Robert, is it possible to have a IRC meeting? I’d prefer to IRC meeting because it’s more openstack style and also can keep the minutes clearly. To your flow, can you give more detailed example. For example, I can consider user specify the instance with –nic option specify a network id, and then how nova device the requirement to the PCI device? I assume the network id should define the switches that the device can connect to , but how is that information translated to the PCI property requirement? Will this translation happen before the nova scheduler make host decision? Thanks --jyh From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Monday, October 28, 2013 12:22 PM To: Irena Berezovsky; prashant.upadhy...@aricent.commailto:prashant.upadhy...@aricent.com; Jiang, Yunhong; chris.frie...@windriver.commailto:chris.frie...@windriver.com; He, Yongli; Itzik Brown Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi Irena, Thank you very much for your comments. See inline. --Robert On 10/27/13 3:48 AM, Irena Berezovsky ire...@mellanox.commailto:ire...@mellanox.com wrote: Hi Robert, Thank you very much for sharing the information regarding your efforts. Can you please share your idea of the end to end flow? How do you suggest to bind Nova and Neutron? The end to end flow is actually encompassed in the blueprints in a nutshell. I will reiterate it in below. The binding between Nova and Neutron occurs with the neutron v2 API that nova invokes in order to provision the neutron services. The vif driver is responsible for plugging in an instance onto the networking setup that neutron has created on the host. Normally, one will invoke nova boot api with the —nic options to specify the nic with which the instance will be connected to the network. It currently allows net-id, fixed ip and/or port-id to be specified for the option. However, it doesn't allow one to specify special networking requirements for the instance. Thanks to the nova pci-passthrough work, one can specify PCI passthrough device(s) in the nova flavor. But it doesn't provide means to tie up these PCI devices in the case of ethernet adpators with networking services. Therefore the idea is actually simple as indicated by the blueprint titles, to provide means to tie up SRIOV devices with neutron services. A work flow would roughly look like this for 'nova boot': -- Specifies networking requirements in the —nic option. Specifically for SRIOV, allow the following
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi Yunhong, I haven't looked at Mellanox in much detail. I think that we'll get more details from Irena down the road. Regarding your question, I can only answer based on my experience with Cisco's VM-FEX. In a nutshell: -- a vNIC is connected to an external switch. Once the host is booted up, all the PFs and VFs provisioned on the vNIC will be created, as well as all the corresponding ethernet interfaces . -- As far as Neutron is concerned, a neutron port can be associated with a VF. One way to do so is to specify this requirement in the —nic option, providing information such as: . PCI alias (this is the same alias as defined in your nova blueprints) . direct pci-passthrough/macvtap . port profileid that is compliant with 802.1Qbh -- similar to how you translate the nova flavor with PCI requirements to PCI requests for scheduling purpose, Nova API (the nova api component) can translate the above to PCI requests for scheduling purpose. I can give more detail later on this. Regarding your last question, since the vNIC is already connected with the external switch, the vNIC driver will be responsible for communicating the port profile to the external switch. As you have already known, libvirt provides several ways to specify a VM to be booted up with SRIOV. For example, in the following interface definition: interface type='hostdev' managed='yes' source address type='pci' domain='0' bus='0x09' slot='0x0' function='0x01'/ /source mac address='01:23:45:67:89:ab' / virtualport type='802.1Qbh' parameters profileid='my-port-profile' / /virtualport /interface The SRIOV VF (bus 0x09, VF 0x01) will be allocated, and the port profile 'my-port-profile' will be used to provision this VF. Libvirt will be responsible for invoking the vNIC driver to configure this VF with the port profile my-port-porfile. The driver will talk to the external switch using the 802.1qbh standards to complete the VF's configuration and binding with the VM. Now that nova PCI passthrough is responsible for discovering/scheduling/allocating a VF, the rest of the puzzle is to associate this PCI device with the feature that's going to use it, and the feature will be responsible for configuring it. You can also see from the above example, in one implementation of SRIOV, the feature (in this case neutron) may not need to do much in terms of working with the external switch, the work is actually done by libvirt behind the scene. Now the questions are: -- how the port profile gets defined/managed -- how the port profile gets associated with a neutron network The first question will be specific to the particular product, and therefore a particular neutron plugin has to mange that. There may be several approaches to address the second question. For example, in the simplest case, a port profile can be associated with a neutron network. This has some significant drawbacks. Since the port profile defines features for all the ports that use it, the one port profile to one neutron network mapping would mean all the ports on the network will have exactly the same features (for example, QoS characteristics). To make it flexible, the binding of a port profile to a port may be done at the port creation time. Let me know if the above answered your question. thanks, Robert On 10/29/13 3:03 AM, Jiang, Yunhong yunhong.ji...@intel.commailto:yunhong.ji...@intel.com wrote: Robert, is it possible to have a IRC meeting? I’d prefer to IRC meeting because it’s more openstack style and also can keep the minutes clearly. To your flow, can you give more detailed example. For example, I can consider user specify the instance with –nic option specify a network id, and then how nova device the requirement to the PCI device? I assume the network id should define the switches that the device can connect to , but how is that information translated to the PCI property requirement? Will this translation happen before the nova scheduler make host decision? Thanks --jyh From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Monday, October 28, 2013 12:22 PM To: Irena Berezovsky; prashant.upadhy...@aricent.commailto:prashant.upadhy...@aricent.com; Jiang, Yunhong; chris.frie...@windriver.commailto:chris.frie...@windriver.com; He, Yongli; Itzik Brown Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi Irena, Thank you very much for your comments. See inline. --Robert On 10/27/13 3:48 AM, Irena Berezovsky ire...@mellanox.commailto:ire...@mellanox.com wrote: Hi Robert, Thank you very much for sharing the information regarding your efforts. Can you please share your idea of the end to end flow? How do you suggest to bind Nova and Neutron? The end
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi John, Great to hear from you on Cinder with pcipassthrough. I thought that it would be coming. I like the idea. thanks, Robert On 10/29/13 6:46 AM, John Garbutt j...@johngarbutt.com wrote: I would love to see a symmetry between Cinder local volumes and Neutron PCI passthrough VIFs. Not entirely sure I have that clear in my head right now, but I just wanted to share the idea: * describe resource external to nova that is attached to VM in the API (block device mapping and/or vif references) * ideally the nova scheduler needs to be aware of the local capacity, and how that relates to the above information (relates to the cross service scheduling issues) * state of the device should be stored by Neutron/Cinder (attached/detached, capacity, IP, etc), but still exposed to the scheduler * connection params get given to Nova from Neutron/Cinder * nova still has the vif driver or volume driver to make the final connection * the disk should be formatted/expanded, and network info injected in the same way as before (cloud-init, config drive, DHCP, etc) John On 29 October 2013 10:17, Irena Berezovsky ire...@mellanox.com wrote: Hi Jiang, Robert, IRC meeting option works for me. If I understand your question below, you are looking for a way to tie up between requested virtual network(s) and requested PCI device(s). The way we did it in our solution is to map a provider:physical_network to an interface that represents the Physical Function. Every virtual network is bound to the provider:physical_network, so the PCI device should be allocated based on this mapping. We can map a PCI alias to the provider:physical_network. Another topic to discuss is where the mapping between neutron port and PCI device should be managed. One way to solve it, is to propagate the allocated PCI device details to neutron on port creation. In case there is no qbg/qbh support, VF networking configuration should be applied locally on the Host. The question is when and how to apply networking configuration on the PCI device? We see the following options: · it can be done on port creation. · It can be done when nova VIF driver is called for vNIC plugging. This will require to have all networking configuration available to the VIF driver or send request to the neutron server to obtain it. · It can be done by having a dedicated L2 neutron agent on each Host that scans for allocated PCI devices and then retrieves networking configuration from the server and configures the device. The agent will be also responsible for managing update requests coming from the neutron server. For macvtap vNIC type assignment, the networking configuration can be applied by a dedicated L2 neutron agent. BR, Irena From: Jiang, Yunhong [mailto:yunhong.ji...@intel.com] Sent: Tuesday, October 29, 2013 9:04 AM To: Robert Li (baoli); Irena Berezovsky; prashant.upadhy...@aricent.com; chris.frie...@windriver.com; He, Yongli; Itzik Brown Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) Subject: RE: [openstack-dev] [nova] [neutron] PCI pass-through network support Robert, is it possible to have a IRC meeting? I¹d prefer to IRC meeting because it¹s more openstack style and also can keep the minutes clearly. To your flow, can you give more detailed example. For example, I can consider user specify the instance with nic option specify a network id, and then how nova device the requirement to the PCI device? I assume the network id should define the switches that the device can connect to , but how is that information translated to the PCI property requirement? Will this translation happen before the nova scheduler make host decision? Thanks --jyh From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Monday, October 28, 2013 12:22 PM To: Irena Berezovsky; prashant.upadhy...@aricent.com; Jiang, Yunhong; chris.frie...@windriver.com; He, Yongli; Itzik Brown Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi Irena, Thank you very much for your comments. See inline. --Robert On 10/27/13 3:48 AM, Irena Berezovsky ire...@mellanox.com wrote: Hi Robert, Thank you very much for sharing the information regarding your efforts. Can you please share your idea of the end to end flow? How do you suggest to bind Nova and Neutron? The end to end flow is actually encompassed in the blueprints in a nutshell. I will reiterate it in below. The binding between Nova and Neutron occurs with the neutron v2 API that nova invokes in order to provision the neutron services. The vif driver is responsible for plugging in an instance onto the networking setup that neutron has created on the host. Normally, one will invoke nova boot api
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Lots of great info and discussion going on here. One additional thing I would like to mention is regarding PF and VF usage. Normally VFs will be assigned to instances, and the PF will either not be used at all, or maybe some agent in the host of the compute node might have access to the PF for something (management?). There is a neutron design track around the development of service VMs. These are dedicated instances that run neutron services like routers, firewalls, etc. It is plausible that a service VM would like to use PCI passthrough and get the entire PF. This would allow it to have complete control over a physical link, which I think will be wanted in some cases. -- Henry On Tue, Oct 29, at 10:23 am, Irena Berezovsky ire...@mellanox.com wrote: Hi, I would like to share some details regarding the support provided by Mellanox plugin. It enables networking via SRIOV pass-through devices or macvtap interfaces. It plugin is available here: https://github.com/openstack/neutron/tree/master/neutron/plugins/mlnx. To support either PCI pass-through device and macvtap interface type of vNICs, we set neutron port profile:vnic_type according to the required VIF type and then use the created port to ‘nova boot’ the VM. To overcome the missing scheduler awareness for PCI devices which was not part of the Havana release yet, we have an additional service (embedded switch Daemon) that runs on each compute node. This service manages the SRIOV resources allocation, answers vNICs discovery queries and applies VLAN/MAC configuration using standard Linux APIs (code is here: https://github.com/mellanox-openstack/mellanox-eswitchd ). The embedded switch Daemon serves as a glue layer between VIF Driver and Neutron Agent. In the Icehouse Release when SRIOV resources allocation is already part of the Nova, we plan to eliminate the need in embedded switch daemon service. So what is left to figure out is how to tie up between neutron port and PCI device and invoke networking configuration. In our case what we have is actually the Hardware VEB that is not programmed via either 802.1Qbg or 802.1Qbh, but configured locally by Neutron Agent. We also support both Ethernet and InfiniBand physical network L2 technology. This means that we apply different configuration commands to set configuration on VF. I guess what we have to figure out is how to support the generic case for the PCI device networking support, for HW VEB, 802.1Qbg and 802.1Qbh cases. BR, Irena *From:*Robert Li (baoli) [mailto:ba...@cisco.com] *Sent:* Tuesday, October 29, 2013 3:31 PM *To:* Jiang, Yunhong; Irena Berezovsky; prashant.upadhy...@aricent.com; chris.frie...@windriver.com; He, Yongli; Itzik Brown *Cc:* OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) *Subject:* Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi Yunhong, I haven't looked at Mellanox in much detail. I think that we'll get more details from Irena down the road. Regarding your question, I can only answer based on my experience with Cisco's VM-FEX. In a nutshell: -- a vNIC is connected to an external switch. Once the host is booted up, all the PFs and VFs provisioned on the vNIC will be created, as well as all the corresponding ethernet interfaces . -- As far as Neutron is concerned, a neutron port can be associated with a VF. One way to do so is to specify this requirement in the —nic option, providing information such as: . PCI alias (this is the same alias as defined in your nova blueprints) . direct pci-passthrough/macvtap . port profileid that is compliant with 802.1Qbh -- similar to how you translate the nova flavor with PCI requirements to PCI requests for scheduling purpose, Nova API (the nova api component) can translate the above to PCI requests for scheduling purpose. I can give more detail later on this. Regarding your last question, since the vNIC is already connected with the external switch, the vNIC driver will be responsible for communicating the port profile to the external switch. As you have already known, libvirt provides several ways to specify a VM to be booted up with SRIOV. For example, in the following interface definition: *interface type='hostdev' managed='yes'* * source* *address type='pci' domain='0' bus='0x09' slot='0x0' function='0x01'/* * /source* * mac address='01:23:45:67:89:ab' /* * virtualport type='802.1Qbh'* *parameters profileid='my-port-profile' /* * /virtualport* */interface* The SRIOV VF (bus 0x09, VF 0x01) will be allocated, and the port profile 'my-port-profile' will be used to provision this VF. Libvirt will be responsible for invoking the vNIC driver to configure
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
* describe resource external to nova that is attached to VM in the API (block device mapping and/or vif references) * ideally the nova scheduler needs to be aware of the local capacity, and how that relates to the above information (relates to the cross service scheduling issues) I think this possibly a bit different. For volume, it's sure managed by Cinder, but for PCI devices, currently It ;s managed by nova. So we possibly need nova to translate the information (possibly before nova scheduler). * state of the device should be stored by Neutron/Cinder (attached/detached, capacity, IP, etc), but still exposed to the scheduler I'm not sure if we can keep the state of the device in Neutron. Currently nova manage all PCI devices. Thanks --jyh * connection params get given to Nova from Neutron/Cinder * nova still has the vif driver or volume driver to make the final connection * the disk should be formatted/expanded, and network info injected in the same way as before (cloud-init, config drive, DHCP, etc) John On 29 October 2013 10:17, Irena Berezovsky ire...@mellanox.com wrote: Hi Jiang, Robert, IRC meeting option works for me. If I understand your question below, you are looking for a way to tie up between requested virtual network(s) and requested PCI device(s). The way we did it in our solution is to map a provider:physical_network to an interface that represents the Physical Function. Every virtual network is bound to the provider:physical_network, so the PCI device should be allocated based on this mapping. We can map a PCI alias to the provider:physical_network. Another topic to discuss is where the mapping between neutron port and PCI device should be managed. One way to solve it, is to propagate the allocated PCI device details to neutron on port creation. In case there is no qbg/qbh support, VF networking configuration should be applied locally on the Host. The question is when and how to apply networking configuration on the PCI device? We see the following options: * it can be done on port creation. * It can be done when nova VIF driver is called for vNIC plugging. This will require to have all networking configuration available to the VIF driver or send request to the neutron server to obtain it. * It can be done by having a dedicated L2 neutron agent on each Host that scans for allocated PCI devices and then retrieves networking configuration from the server and configures the device. The agent will be also responsible for managing update requests coming from the neutron server. For macvtap vNIC type assignment, the networking configuration can be applied by a dedicated L2 neutron agent. BR, Irena From: Jiang, Yunhong [mailto:yunhong.ji...@intel.com] Sent: Tuesday, October 29, 2013 9:04 AM To: Robert Li (baoli); Irena Berezovsky; prashant.upadhy...@aricent.com; chris.frie...@windriver.com; He, Yongli; Itzik Brown Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) Subject: RE: [openstack-dev] [nova] [neutron] PCI pass-through network support Robert, is it possible to have a IRC meeting? I'd prefer to IRC meeting because it's more openstack style and also can keep the minutes clearly. To your flow, can you give more detailed example. For example, I can consider user specify the instance with -nic option specify a network id, and then how nova device the requirement to the PCI device? I assume the network id should define the switches that the device can connect to , but how is that information translated to the PCI property requirement? Will this translation happen before the nova scheduler make host decision? Thanks --jyh From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Monday, October 28, 2013 12:22 PM To: Irena Berezovsky; prashant.upadhy...@aricent.com; Jiang, Yunhong; chris.frie...@windriver.com; He, Yongli; Itzik Brown Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi Irena, Thank you very much for your comments. See inline. --Robert On 10/27/13 3:48 AM, Irena Berezovsky ire...@mellanox.com wrote: Hi Robert, Thank you very much for sharing the information regarding your efforts. Can you please share your idea of the end to end flow? How do you suggest to bind Nova and Neutron? The end to end flow is actually encompassed in the blueprints in a nutshell. I will reiterate it in below. The binding between Nova and Neutron occurs with the neutron v2 API that nova invokes in order to provision the neutron services. The vif driver is responsible for plugging
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Your explanation of the virtual network and physical network is quite clear and should work well. We need change nova code to achieve it, including get the physical network for the virtual network, passing the physical network requirement to the filter properties etc. For your port method, so you mean we are sure to passing network id to 'nova boot' and nova will create the port during VM boot, am I right? Also, how can nova knows that it need allocate the PCI device for the port? I'd suppose that in SR-IOV NIC environment, user don't need specify the PCI requirement. Instead, the PCI requirement should come from the network configuration and image property. Or you think user still need passing flavor with pci request? --jyh From: Irena Berezovsky [mailto:ire...@mellanox.com] Sent: Tuesday, October 29, 2013 3:17 AM To: Jiang, Yunhong; Robert Li (baoli); prashant.upadhy...@aricent.com; chris.frie...@windriver.com; He, Yongli; Itzik Brown Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) Subject: RE: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi Jiang, Robert, IRC meeting option works for me. If I understand your question below, you are looking for a way to tie up between requested virtual network(s) and requested PCI device(s). The way we did it in our solution is to map a provider:physical_network to an interface that represents the Physical Function. Every virtual network is bound to the provider:physical_network, so the PCI device should be allocated based on this mapping. We can map a PCI alias to the provider:physical_network. Another topic to discuss is where the mapping between neutron port and PCI device should be managed. One way to solve it, is to propagate the allocated PCI device details to neutron on port creation. In case there is no qbg/qbh support, VF networking configuration should be applied locally on the Host. The question is when and how to apply networking configuration on the PCI device? We see the following options: * it can be done on port creation. * It can be done when nova VIF driver is called for vNIC plugging. This will require to have all networking configuration available to the VIF driver or send request to the neutron server to obtain it. * It can be done by having a dedicated L2 neutron agent on each Host that scans for allocated PCI devices and then retrieves networking configuration from the server and configures the device. The agent will be also responsible for managing update requests coming from the neutron server. For macvtap vNIC type assignment, the networking configuration can be applied by a dedicated L2 neutron agent. BR, Irena From: Jiang, Yunhong [mailto:yunhong.ji...@intel.com] Sent: Tuesday, October 29, 2013 9:04 AM To: Robert Li (baoli); Irena Berezovsky; prashant.upadhy...@aricent.commailto:prashant.upadhy...@aricent.com; chris.frie...@windriver.commailto:chris.frie...@windriver.com; He, Yongli; Itzik Brown Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) Subject: RE: [openstack-dev] [nova] [neutron] PCI pass-through network support Robert, is it possible to have a IRC meeting? I'd prefer to IRC meeting because it's more openstack style and also can keep the minutes clearly. To your flow, can you give more detailed example. For example, I can consider user specify the instance with -nic option specify a network id, and then how nova device the requirement to the PCI device? I assume the network id should define the switches that the device can connect to , but how is that information translated to the PCI property requirement? Will this translation happen before the nova scheduler make host decision? Thanks --jyh From: Robert Li (baoli) [mailto:ba...@cisco.com] Sent: Monday, October 28, 2013 12:22 PM To: Irena Berezovsky; prashant.upadhy...@aricent.commailto:prashant.upadhy...@aricent.com; Jiang, Yunhong; chris.frie...@windriver.commailto:chris.frie...@windriver.com; He, Yongli; Itzik Brown Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi Irena, Thank you very much for your comments. See inline. --Robert On 10/27/13 3:48 AM, Irena Berezovsky ire...@mellanox.commailto:ire...@mellanox.com wrote: Hi Robert, Thank you very much for sharing the information regarding your efforts. Can you please share your idea of the end to end flow? How do you suggest to bind Nova and Neutron? The end to end flow is actually encompassed in the blueprints in a nutshell. I will reiterate it in below. The binding between Nova and Neutron occurs with the neutron v2 API that nova invokes in order to provision the neutron services. The vif driver is responsible for plugging in an instance onto
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Henry,why do you think the service VM need the entire PF instead of a VF? I think the SR-IOV NIC should provide QoS and performance isolation. As to assign entire PCI device to a guest, that should be ok since usually PF and VF has different device ID, the tricky thing is, at least for some PCI devices, you can't configure that some NIC will have SR-IOV enabled while others not. Thanks --jyh -Original Message- From: Henry Gessau [mailto:ges...@cisco.com] Sent: Tuesday, October 29, 2013 8:10 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Lots of great info and discussion going on here. One additional thing I would like to mention is regarding PF and VF usage. Normally VFs will be assigned to instances, and the PF will either not be used at all, or maybe some agent in the host of the compute node might have access to the PF for something (management?). There is a neutron design track around the development of service VMs. These are dedicated instances that run neutron services like routers, firewalls, etc. It is plausible that a service VM would like to use PCI passthrough and get the entire PF. This would allow it to have complete control over a physical link, which I think will be wanted in some cases. -- Henry On Tue, Oct 29, at 10:23 am, Irena Berezovsky ire...@mellanox.com wrote: Hi, I would like to share some details regarding the support provided by Mellanox plugin. It enables networking via SRIOV pass-through devices or macvtap interfaces. It plugin is available here: https://github.com/openstack/neutron/tree/master/neutron/plugins/mln x. To support either PCI pass-through device and macvtap interface type of vNICs, we set neutron port profile:vnic_type according to the required VIF type and then use the created port to 'nova boot' the VM. To overcome the missing scheduler awareness for PCI devices which was not part of the Havana release yet, we have an additional service (embedded switch Daemon) that runs on each compute node. This service manages the SRIOV resources allocation, answers vNICs discovery queries and applies VLAN/MAC configuration using standard Linux APIs (code is here: https://github.com/mellanox-openstack/mellanox-eswitchd ). The embedded switch Daemon serves as a glue layer between VIF Driver and Neutron Agent. In the Icehouse Release when SRIOV resources allocation is already part of the Nova, we plan to eliminate the need in embedded switch daemon service. So what is left to figure out is how to tie up between neutron port and PCI device and invoke networking configuration. In our case what we have is actually the Hardware VEB that is not programmed via either 802.1Qbg or 802.1Qbh, but configured locally by Neutron Agent. We also support both Ethernet and InfiniBand physical network L2 technology. This means that we apply different configuration commands to set configuration on VF. I guess what we have to figure out is how to support the generic case for the PCI device networking support, for HW VEB, 802.1Qbg and 802.1Qbh cases. BR, Irena *From:*Robert Li (baoli) [mailto:ba...@cisco.com] *Sent:* Tuesday, October 29, 2013 3:31 PM *To:* Jiang, Yunhong; Irena Berezovsky; prashant.upadhy...@aricent.com; chris.frie...@windriver.com; He, Yongli; Itzik Brown *Cc:* OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) *Subject:* Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi Yunhong, I haven't looked at Mellanox in much detail. I think that we'll get more details from Irena down the road. Regarding your question, I can only answer based on my experience with Cisco's VM-FEX. In a nutshell: -- a vNIC is connected to an external switch. Once the host is booted up, all the PFs and VFs provisioned on the vNIC will be created, as well as all the corresponding ethernet interfaces . -- As far as Neutron is concerned, a neutron port can be associated with a VF. One way to do so is to specify this requirement in the -nic option, providing information such as: . PCI alias (this is the same alias as defined in your nova blueprints) . direct pci-passthrough/macvtap . port profileid that is compliant with 802.1Qbh -- similar to how you translate the nova flavor with PCI requirements to PCI requests for scheduling purpose, Nova API (the nova api component) can translate the above to PCI requests for scheduling purpose. I can give more detail later on this. Regarding your last question, since the vNIC is already connected with the external switch, the vNIC driver will be responsible for communicating
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On Tue, Oct 29, at 4:31 pm, Jiang, Yunhong yunhong.ji...@intel.com wrote: Henry,why do you think the service VM need the entire PF instead of a VF? I think the SR-IOV NIC should provide QoS and performance isolation. I was speculating. I just thought it might be a good idea to leave open the possibility of assigning a PF to a VM if the need arises. Neutron service VMs are a new thing. I will be following the discussions and there is a summit session for them. It remains to be seen if there is any desire/need for full PF ownership of NICs. But if a service VM owns the PF and has the right NIC driver it could do some advanced features with it. As to assign entire PCI device to a guest, that should be ok since usually PF and VF has different device ID, the tricky thing is, at least for some PCI devices, you can't configure that some NIC will have SR-IOV enabled while others not. Thanks for the warning. :) Perhaps the cloud admin might plug in an extra NIC in just a few nodes (one or two per rack, maybe) for the purpose of running service VMs there. Again, just speculating. I don't know how hard it is to manage non-homogenous nodes. Thanks --jyh -Original Message- From: Henry Gessau [mailto:ges...@cisco.com] Sent: Tuesday, October 29, 2013 8:10 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Lots of great info and discussion going on here. One additional thing I would like to mention is regarding PF and VF usage. Normally VFs will be assigned to instances, and the PF will either not be used at all, or maybe some agent in the host of the compute node might have access to the PF for something (management?). There is a neutron design track around the development of service VMs. These are dedicated instances that run neutron services like routers, firewalls, etc. It is plausible that a service VM would like to use PCI passthrough and get the entire PF. This would allow it to have complete control over a physical link, which I think will be wanted in some cases. -- Henry On Tue, Oct 29, at 10:23 am, Irena Berezovsky ire...@mellanox.com wrote: Hi, I would like to share some details regarding the support provided by Mellanox plugin. It enables networking via SRIOV pass-through devices or macvtap interfaces. It plugin is available here: https://github.com/openstack/neutron/tree/master/neutron/plugins/mln x. To support either PCI pass-through device and macvtap interface type of vNICs, we set neutron port profile:vnic_type according to the required VIF type and then use the created port to 'nova boot' the VM. To overcome the missing scheduler awareness for PCI devices which was not part of the Havana release yet, we have an additional service (embedded switch Daemon) that runs on each compute node. This service manages the SRIOV resources allocation, answers vNICs discovery queries and applies VLAN/MAC configuration using standard Linux APIs (code is here: https://github.com/mellanox-openstack/mellanox-eswitchd ). The embedded switch Daemon serves as a glue layer between VIF Driver and Neutron Agent. In the Icehouse Release when SRIOV resources allocation is already part of the Nova, we plan to eliminate the need in embedded switch daemon service. So what is left to figure out is how to tie up between neutron port and PCI device and invoke networking configuration. In our case what we have is actually the Hardware VEB that is not programmed via either 802.1Qbg or 802.1Qbh, but configured locally by Neutron Agent. We also support both Ethernet and InfiniBand physical network L2 technology. This means that we apply different configuration commands to set configuration on VF. I guess what we have to figure out is how to support the generic case for the PCI device networking support, for HW VEB, 802.1Qbg and 802.1Qbh cases. BR, Irena *From:*Robert Li (baoli) [mailto:ba...@cisco.com] *Sent:* Tuesday, October 29, 2013 3:31 PM *To:* Jiang, Yunhong; Irena Berezovsky; prashant.upadhy...@aricent.com; chris.frie...@windriver.com; He, Yongli; Itzik Brown *Cc:* OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) *Subject:* Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi Yunhong, I haven't looked at Mellanox in much detail. I think that we'll get more details from Irena down the road. Regarding your question, I can only answer based on my experience with Cisco's VM-FEX. In a nutshell: -- a vNIC is connected to an external switch. Once the host is booted up, all the PFs and VFs provisioned on the vNIC will be created, as well as all the corresponding ethernet interfaces . -- As far as Neutron is concerned, a neutron port
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
-Original Message- From: Henry Gessau [mailto:ges...@cisco.com] Sent: Tuesday, October 29, 2013 2:23 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support On Tue, Oct 29, at 4:31 pm, Jiang, Yunhong yunhong.ji...@intel.com wrote: Henry,why do you think the service VM need the entire PF instead of a VF? I think the SR-IOV NIC should provide QoS and performance isolation. I was speculating. I just thought it might be a good idea to leave open the possibility of assigning a PF to a VM if the need arises. Neutron service VMs are a new thing. I will be following the discussions and there is a summit session for them. It remains to be seen if there is any desire/need for full PF ownership of NICs. But if a service VM owns the PF and has the right NIC driver it could do some advanced features with it. At least in current PCI implementation, if a device has no SR-IOV enabled, then that device will be exposed and can be assigned (is this your so-called PF?). If a device has SR-IOV enabled, then only VF be exposed and the PF is hidden from resource tracker. The reason is, when SR-IOV enabled, the PF is mostly used to configure and management the VFs, and it will be security issue to expose the PF to a guest. I'm not sure if you are talking about the PF, are you talking about the PF w/ or w/o SR-IOV enabled. I totally agree that assign a PCI NIC to service VM have a lot of benefit from both performance and isolation point of view. Thanks --jyh ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
On Tue, Oct 29, at 5:52 pm, Jiang, Yunhong yunhong.ji...@intel.com wrote: -Original Message- From: Henry Gessau [mailto:ges...@cisco.com] Sent: Tuesday, October 29, 2013 2:23 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support On Tue, Oct 29, at 4:31 pm, Jiang, Yunhong yunhong.ji...@intel.com wrote: Henry,why do you think the service VM need the entire PF instead of a VF? I think the SR-IOV NIC should provide QoS and performance isolation. I was speculating. I just thought it might be a good idea to leave open the possibility of assigning a PF to a VM if the need arises. Neutron service VMs are a new thing. I will be following the discussions and there is a summit session for them. It remains to be seen if there is any desire/need for full PF ownership of NICs. But if a service VM owns the PF and has the right NIC driver it could do some advanced features with it. At least in current PCI implementation, if a device has no SR-IOV enabled, then that device will be exposed and can be assigned (is this your so-called PF?). Apologies, this was not clear to me until now. Thanks. I am not aware of a use-case for a service VM needing to control VFs. So you are right, I should not have talked about PF but rather just the entire NIC device in passthrough mode, no SR-IOV needed. So the admin will need to know: Put a NIC in SR-IOV mode if it is to be used by multiple VMs. Put a NIC in single device passthrough mode if it is to be used by one service VM. If a device has SR-IOV enabled, then only VF be exposed and the PF is hidden from resource tracker. The reason is, when SR-IOV enabled, the PF is mostly used to configure and management the VFs, and it will be security issue to expose the PF to a guest. Thanks for bringing up the security issue. If a physical network interface is connected in a special way to some switch/router with the intention being for it to be used only by a service VM, then close attention must be paid to security. The device owner might get some low-level network access that can be misused. I'm not sure if you are talking about the PF, are you talking about the PF w/ or w/o SR-IOV enabled. I totally agree that assign a PCI NIC to service VM have a lot of benefit from both performance and isolation point of view. Thanks --jyh ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
-Original Message- From: Isaku Yamahata [mailto:isaku.yamah...@gmail.com] Sent: Tuesday, October 29, 2013 8:24 PM To: OpenStack Development Mailing List (not for usage questions) Cc: isaku.yamah...@gmail.com; Itzik Brown Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi Yunhong. On Tue, Oct 29, 2013 at 08:22:40PM +, Jiang, Yunhong yunhong.ji...@intel.com wrote: * describe resource external to nova that is attached to VM in the API (block device mapping and/or vif references) * ideally the nova scheduler needs to be aware of the local capacity, and how that relates to the above information (relates to the cross service scheduling issues) I think this possibly a bit different. For volume, it's sure managed by Cinder, but for PCI devices, currently It ;s managed by nova. So we possibly need nova to translate the information (possibly before nova scheduler). * state of the device should be stored by Neutron/Cinder (attached/detached, capacity, IP, etc), but still exposed to the scheduler I'm not sure if we can keep the state of the device in Neutron. Currently nova manage all PCI devices. Yes, with the current implementation, nova manages PCI devices and it works. That's great. It will remain so in Icehouse cycle (maybe also J?). But how about long term direction? Neutron should know/manage such network related resources on compute nodes? So you mean the PCI device management will be spited between Nova and Neutron? For example, non-NIC device owned by nova and NIC device owned by neutron? There have been so many discussion of the scheduler enhancement, like https://etherpad.openstack.org/p/grizzly-split-out-scheduling , so possibly that's the right direction? Let's wait for the summit discussion. The implementation in Nova will be moved into Neutron like what Cinder did? any opinions/thoughts? It seems that not so many Neutron developers are interested in PCI passthrough at the moment, though. There are use cases for this, I think. For example, some compute nodes use OVS plugin, another nodes LB plugin. (Right now it may not possible easily, but it will be with ML2 plugin and mechanism driver). User wants their VMs to run on nodes with OVS plugin for some reason(e.g. performance difference). Such usage would be handled similarly. Thanks, --- Isaku Yamahata Thanks --jyh * connection params get given to Nova from Neutron/Cinder * nova still has the vif driver or volume driver to make the final connection * the disk should be formatted/expanded, and network info injected in the same way as before (cloud-init, config drive, DHCP, etc) John On 29 October 2013 10:17, Irena Berezovsky ire...@mellanox.com wrote: Hi Jiang, Robert, IRC meeting option works for me. If I understand your question below, you are looking for a way to tie up between requested virtual network(s) and requested PCI device(s). The way we did it in our solution is to map a provider:physical_network to an interface that represents the Physical Function. Every virtual network is bound to the provider:physical_network, so the PCI device should be allocated based on this mapping. We can map a PCI alias to the provider:physical_network. Another topic to discuss is where the mapping between neutron port and PCI device should be managed. One way to solve it, is to propagate the allocated PCI device details to neutron on port creation. In case there is no qbg/qbh support, VF networking configuration should be applied locally on the Host. The question is when and how to apply networking configuration on the PCI device? We see the following options: * it can be done on port creation. * It can be done when nova VIF driver is called for vNIC plugging. This will require to have all networking configuration available to the VIF driver or send request to the neutron server to obtain it. * It can be done by having a dedicated L2 neutron agent on each Host that scans for allocated PCI devices and then retrieves networking configuration from the server and configures the device. The agent will be also responsible for managing update requests coming from the neutron server. For macvtap vNIC type assignment, the networking configuration can be applied by a dedicated L2 neutron agent. BR, Irena From: Jiang, Yunhong [mailto:yunhong.ji...@intel.com] Sent: Tuesday, October 29, 2013 9:04 AM To: Robert Li (baoli); Irena Berezovsky; prashant.upadhy...@aricent.com; chris.frie...@windriver.com; He, Yongli; Itzik Brown Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
is planned to be discussed during the summit: http://summit.openstack.org/cfp/details/129. I think it’s worth to drill down into more detailed proposal and present it during the summit, especially since it impacts both nova and neutron projects. I agree. Maybe we can steal some time in that discussion. Would you be interested in collaboration on this effort? Would you be interested to exchange more emails or set an IRC/WebEx meeting during this week before the summit? Sure. If folks want to discuss it before the summit, we can schedule a webex later this week. Or otherwise, we can continue the discussion with email. Regards, Irena *From:*Robert Li (baoli) [mailto:ba...@cisco.com] *Sent:* Friday, October 25, 2013 11:16 PM *To:* prashant.upadhy...@aricent.com mailto:prashant.upadhy...@aricent.com; Irena Berezovsky; yunhong.ji...@intel.com mailto:yunhong.ji...@intel.com; chris.frie...@windriver.com mailto:chris.frie...@windriver.com; yongli...@intel.com mailto:yongli...@intel.com *Cc:* OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu) *Subject:* Re: [openstack-dev] [nova] [neutron] PCI pass-through network support Hi Irena, This is Robert Li from Cisco Systems. Recently, I was tasked to investigate such support for Cisco's systems that support VM-FEX, which is a SRIOV technology supporting 802-1Qbh. I was able to bring up nova instances with SRIOV interfaces, and establish networking in between the instances that employes the SRIOV interfaces. Certainly, this was accomplished with hacking and some manual intervention. Based on this experience and my study with the two existing nova pci-passthrough blueprints that have been implemented and committed into Havana (https://blueprints.launchpad.net/nova/+spec/pci-passthrough-base and https://blueprints.launchpad.net/nova/+spec/pci-passthrough-libvirt), I registered a couple of blueprints (one on Nova side, the other on the Neutron side): https://blueprints.launchpad.net/nova/+spec/pci-passthrough-sriov https://blueprints.launchpad.net/neutron/+spec/pci-passthrough-sriov in order to address SRIOV support in openstack. Please take a look at them and see if they make sense, and let me know any comments and questions. We can also discuss this in the summit, I suppose. I noticed that there is another thread on this topic, so copy those folks from that thread as well. thanks, Robert On 10/16/13 4:32 PM, Irena Berezovsky ire...@mellanox.com mailto:ire...@mellanox.com wrote: Hi, As one of the next steps for PCI pass-through I would like to discuss is the support for PCI pass-through vNIC. While nova takes care of PCI pass-through device resources management and VIF settings, neutron should manage their networking configuration. I would like to register asummit proposal to discuss the support for PCI pass-through networking. I am not sure what would be the right topic to discuss the PCI pass-through networking, since it involve both nova and neutron. There is already a session registered by Yongli on nova topic to discuss the PCI pass-through next steps. I think PCI pass-through networking is quite a big topic and it worth to have a separate discussion. Is there any other people who are interested to discuss it and share their thoughts and experience? Regards, Irena ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Hi Irena, This is Robert Li from Cisco Systems. Recently, I was tasked to investigate such support for Cisco's systems that support VM-FEX, which is a SRIOV technology supporting 802-1Qbh. I was able to bring up nova instances with SRIOV interfaces, and establish networking in between the instances that employes the SRIOV interfaces. Certainly, this was accomplished with hacking and some manual intervention. Based on this experience and my study with the two existing nova pci-passthrough blueprints that have been implemented and committed into Havana (https://blueprints.launchpad.net/nova/+spec/pci-passthrough-base and https://blueprints.launchpad.net/nova/+spec/pci-passthrough-libvirt), I registered a couple of blueprints (one on Nova side, the other on the Neutron side): https://blueprints.launchpad.net/nova/+spec/pci-passthrough-sriov https://blueprints.launchpad.net/neutron/+spec/pci-passthrough-sriov in order to address SRIOV support in openstack. Please take a look at them and see if they make sense, and let me know any comments and questions. We can also discuss this in the summit, I suppose. I noticed that there is another thread on this topic, so copy those folks from that thread as well. thanks, Robert On 10/16/13 4:32 PM, Irena Berezovsky ire...@mellanox.commailto:ire...@mellanox.com wrote: Hi, As one of the next steps for PCI pass-through I would like to discuss is the support for PCI pass-through vNIC. While nova takes care of PCI pass-through device resources management and VIF settings, neutron should manage their networking configuration. I would like to register a summit proposal to discuss the support for PCI pass-through networking. I am not sure what would be the right topic to discuss the PCI pass-through networking, since it involve both nova and neutron. There is already a session registered by Yongli on nova topic to discuss the PCI pass-through next steps. I think PCI pass-through networking is quite a big topic and it worth to have a separate discussion. Is there any other people who are interested to discuss it and share their thoughts and experience? Regards, Irena ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] [neutron] PCI pass-through network support
Hi, As one of the next steps for PCI pass-through I would like to discuss is the support for PCI pass-through vNIC. While nova takes care of PCI pass-through device resources management and VIF settings, neutron should manage their networking configuration. I would like to register a summit proposal to discuss the support for PCI pass-through networking. I am not sure what would be the right topic to discuss the PCI pass-through networking, since it involve both nova and neutron. There is already a session registered by Yongli on nova topic to discuss the PCI pass-through next steps. I think PCI pass-through networking is quite a big topic and it worth to have a separate discussion. Is there any other people who are interested to discuss it and share their thoughts and experience? Regards, Irena ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev