In any case, we don't have to decide this now. If we simply allowed the whitelist to add extra arbitrary properties to the PCI record (like a group name) and return it to the central server, we could use that in scheduling for the minute as a group name, we wouldn't implement the APIs for flavors yet, and we could get a working system that would be minimally changed from what we already have. We could worry about the scheduling in the scheduling group, and we could leave the APIs (which, as I say, are a minimally useful feature) untl later. then we'd have something useful in short order. -- Ian.
On 10 January 2014 13:08, Ian Wells <ijw.ubu...@cack.org.uk> wrote: > On 10 January 2014 07:40, Jiang, Yunhong <yunhong.ji...@intel.com> wrote: > >> Robert, sorry that I’m not fan of * your group * term. To me, *your >> group” mixed two thing. It’s an extra property provided by configuration, >> and also it’s a very-not-flexible mechanism to select devices (you can only >> select devices based on the ‘group name’ property). >> >> > It is exactly that. It's 0 new config items, 0 new APIs, just an extra > tag on the whitelists that are already there (although the proposal > suggests changing the name of them to be more descriptive of what they now > do). And you talk about flexibility as if this changes frequently, but in > fact the grouping / aliasing of devices almost never changes after > installation, which is, not coincidentally, when the config on the compute > nodes gets set up. > >> 1) A dynamic group is much better. For example, user may want to >> select GPU device based on vendor id, or based on vendor_id+device_id. In >> another word, user want to create group based on vendor_id, or >> vendor_id+device_id and select devices from these group. John’s proposal >> is very good, to provide an API to create the PCI flavor(or alias). I >> prefer flavor because it’s more openstack style. >> > I disagree with this. I agree that what you're saying offers a more > flexibilibility after initial installation but I have various issues with > it. > > This is directly related to the hardware configuation on each compute > node. For (some) other things of this nature, like provider networks, the > compute node is the only thing that knows what it has attached to it, and > it is the store (in configuration) of that information. If I add a new > compute node then it's my responsibility to configure it correctly on > attachment, but when I add a compute node (when I'm setting the cluster up, > or sometime later on) then it's at that precise point that I know how I've > attached it and what hardware it's got on it. Also, it's at this that > point in time that I write out the configuration file (not by hand, note; > there's almost certainly automation when configuring hundreds of nodes so > arguments that 'if I'm writing hundreds of config files one will be wrong' > are moot). > > I'm also not sure there's much reason to change the available devices > dynamically after that, since that's normally an activity that results from > changing the physical setup of the machine which implies that actually > you're going to have access to and be able to change the config as you do > it. John did come up with one case where you might be trying to remove old > GPUs from circulation, but it's a very uncommon case that doesn't seem > worth coding for, and it's still achievable by changing the config and > restarting the compute processes. > > This also reduces the autonomy of the compute node in favour of > centralised tracking, which goes against the 'distributed where possible' > philosophy of Openstack. > > Finally, you're not actually removing configuration from the compute > node. You still have to configure a whitelist there; in the grouping > design you also have to configure grouping (flavouring) on the control node > as well. The groups proposal adds one extra piece of information to the > whitelists that are already there to mark groups, not a whole new set of > config lines. > > > To compare scheduling behaviour: > > If I need 4G of RAM, each compute node has reported its summary of free > RAM to the scheduler. I look for a compute node with 4G free, and filter > the list of compute nodes down. This is a query on n records, n being the > number of compute nodes. I schedule to the compute node, which then > confirms it does still have 4G free and runs the VM or rejects the request. > > If I need 3 PCI devices and use the current system, each machine has > reported its device allocations to the scheduler. With SRIOV multiplying > up the number of available devices, it's reporting back hundreds of records > per compute node to the schedulers, and the filtering activity is a 3 > queries on n * number of PCI devices in cloud records, which could easily > end up in the tens or even hundreds of thousands of records for a > moderately sized cloud. There compute node also has a record of its device > allocations which is also checked and updated before the final request is > run. > > If I need 3 PCI devices and use the groups system, each machine has > reported its device *summary* to the scheduler. With SRIOV multiplying up > the number of available devices, it's still reporting one or a small number > of categories, i.e. { net: 100}. The difficulty of scheduling is a query > on num groups * n records - fewer, in fact, if some machines have no > passthrough devices. > > You can see that there's quite a cost to be paid for having those flexible > alias APIs. > >> 4) IMHO, the core for nova PCI support is **PCI property**. The >> property means not only generic PCI devices like vendor id, device id, >> device type, compute specific property like BDF address, the adjacent >> switch IP address, but also user defined property like nuertron’s physical >> net name etc. And then, it’s about how to get these property, how to >> select/group devices based on the property, how to store/fetch these >> properties. >> > > The thing about this is that you don't always, or even often, want to > select by property. Some of these properties are just things that you need > to tell Neutron, they're not usually keys for scheduling. >
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev