Jim, James Carlson wrote: > Nicolas Droux writes: >> From the administration interface point of view, there are two ways to >> associate properties with data-links. For data-links that are created >> through a dladm subcommand like create-vnic, the initial set of >> properties can be specified during the creation of the data-link itself >> through an dedicated option. In addition the properties can be set on >> any data-link through the set-linkprop subcommand. The former allows the >> administrator to create a VNIC with bandwidth control in a single >> command instead of having to go through a two step dance. > > Does this mean that the same properties will be accessible via both > "modify-vnic" and "set-linkprop"? > > I can understand wanting to set some initial properties at create > time, but it seems odd that the new general properties are segregated > into VNIC-specific commands.
No, only set-linkprop will be used to change these properties, not modify-vnic. We'll send out updated man pages to reflect these changes, and they will be different than the man pages that were published as part of our current bits. >>> - Do bandwidth and CPU controls rely on squeues? If so, then VNICs >>> may not be able to control utilization from non-IP traffic, such >>> as with bridging. >> There is a level of bandwidth control done by squeue, but there's also a >> bandwidth control done by the MAC layer itself. Which is useful when >> there's a need to do bandwidth control before fanout to multiple CPUs at >> the MAC layer, and also for non-IP protocols, or when the MAC is being >> used by a virtual machines back-end drivers in the host OS. See also >> Sunay's writeup at >> http://www.opensolaris.org/os/project/crossbow/Design_softringset.txt >> for more details on this topic. > > I had found and read that document before writing my comment. > > I still don't quite see the relationship here. What are the > responsibilities of the two mechanisms (the mac layer and the > squeues)? > > To put the question in another way: suppose I have a non-IP protocol > using a VNIC with a bandwidth control set on it. What happens? Are > there features that were related to squeues that I won't be able to > use? If so, then what are those features? The client will see a MAC which has a bandwidth limit, nothing else is required. > Or, to put it another way still: are there things that non-IP > protocols should or could be doing in order to "cooperate" with this > bandwidth control so that they behave as well as IP's squeues will? No, no special requirements. The bandwidth limits set on a MAC will be enforced by the MAC layer SRS. We'll also have a flow API which will be available to MAC clients to define bandwidth limits for services, etc, and used by clients like IP when needed. >> I was trying to allow the system administrator to minimize the impact on >> the existing MAC address assignment when moving a VNIC to be moved off >> and back to a device. But I agree that it's not optimal. If the folks on >> this list feel that the MAC address changing is not an issue, I've no >> problem using the simpler scheme of reassigning a new MAC address to the >> VNIC/MAC client. > > If I (as a system administrator) say "factory" as part of the > configuration of the interface, then I'd expect to get a factory- > supplied address. My expectation would be that when the factory- > supplied components are swapped out underneath, the address changes. Actually there are three sub-cases to this I think: 1. If the administrator does not specify an address (automatic assignment), and a factory MAC address is assigned to the VNIC. In this case, I think it's fine to assign a different MAC address, e.g. a random one, to the VNIC if the VNIC is moved to a NIC which does not have available factory MAC addresses. 2. If the administrator requested a factory MAC addresses explicitly, then the VNIC could be moved to a different NIC which has an available factory MAC address. Otherwise the operation would fail unless a force flag is set. 3. If the administrator requested a factory MAC address of a specific slot, then there's a clear intent of using a specific MAC address of the device underneath. In that case the move operation would fail unless a force flag is set. > Having the factory-supplied address come unmoored from the device > itself seems odd to me, and almost certain to cause trouble. I > suppose it could be possible to create a "adopt the factory address > and treat it as though it were my own statically-configured address" > option, but I'd certainly want to see it come with adequate warnings > about the dangers and a clear user interface (not "factory" but > "steal-from-factory" ;-}). I'm not sure that it'd be administratively > interesting, though. Yes, there's a risk of duplicate addresses if that option was chosen, and the source NIC ends-up being recycled later, that's less than ideal. >>> - What happens if a NIC is oversubscribed by the amount of bandwidth >>> configured for the VNICs? Is the result proportionate (and thus >>> "fair") allocation, or do they compete on some other grounds? >> In that case it will depend on other factors such as the type of >> traffic, the CPU(s) processing that traffic, etc. > > I suggest putting more effort into characterizing this, because > oversubscribing is a common and fairly well understood way to balance > risk versus utilization and occurs often in handling failure scenarios > (such as with aggregation). > > I've seen similar schemes for access servers (most have proprietary > RADIUS extensions for setting bandwidth limits), and the usual way > this works is that once the link is saturated, the configured limits > become shares. Thus, the clients are all hurt in proportion to the > amount of bandwidth they're given. The limits are really used to clamp down on bandwidth utilization by a MAC, but they do not imply any guaranteed bandwidth. As a future deliverable we're also planning to provide bandwidth guarantees which is what you seem to be referring to here. >>> What kind of bandwidth control exists here? How granular is it, >>> and what effects do clients see from restricted bandwidth? Are >>> packets dropped (they have to be, if bandwidth limits apply to >>> forwarded traffic)? If so, is it tail drop or something more >>> sophisticated? >> In general if a SRS or flow is assigned its own hardware ring, then the >> polling thread will poll packets directly from the ring, and there's no >> dropping from the host. Packets will be polled from the rings when >> allowed as per bandwidth limits and consumption. The polling thread is >> scheduled every tick, and we compute a maximum number of bytes per tick. >> >> If more than one SRS/squeue share a ring, there's no polling of the >> ring. Instead, traffic will be interrupt driven, and packets will be >> deposited on queues associated with the SRS/squeue. Packets are then >> pulled from these queues based on bandwidth limits. If the maximum >> number of packets in these queues is exceeded, then there's tail drop. >> Again, see the SRS design doc. > > "Tail drop" looks like the answer I was looking for. > > In that case, you might want to consider (at least as an RFE) > including basic RED support here. There can be a big difference in > behavior between hardware-imposed limits (ones that presumably affect > both the sender and receiver in most cases) and artificial limits > because the network behavior is quite different, and tail-drop is > known to cause poor TCP performance. Agreed. We still to document in more details our existing scheme here, and we should discuss alternatives as part of that text. >>> - Instead of adding more arguments to mac_open() to handle priority >>> and bandwidth, I'd suggest making these separate calls. You'll >>> need the separate call anyway to implement the "modify" mechanism. >> Having the parameters specified in mac_open() is useful since they allow >> these parameters to be specified when the resources are allocated to >> the MAC client. This avoids allocating a set of default resources and >> then immediately changing these resources through a separate modify >> mechanism. If we can specify through 2-3 arguments I don't think this >> should be an issue. > > I think it's much more flexible and easier to do it later. > > You're going to need a function to change the values after mac_open() > time. By supplying the same values during mac_open(), you're just > duplicating that functionality. It might be a single "piece of code" which can be called to allocate resources according to these parameters from both the open and modify functions. I think the duplication can be avoided. > Worse, mac_open() is a core function, while resource control is at the > periphery. If you need to modify mac_open() every time resource > controls are tweaked -- consider what happens when shared resources > are introduced (allowing control of multiple interfaces as a group), > or when more advanced queuing disciplines are allowed -- then this > interface will never settle down and never be appropriate as a DDI > function. > > Separating these two allows you to add new control functions in the > future without having to modify every mac_open() caller. > > It's as though every fcntl(2) feature needed to be supplied in > open(2). > > Why is the resource allocation itself an important thing to optimize > versus the interface stability and scalability? I don't agree with the "core function" vs "periphery" argument. The resource control is becoming an integral part of the MAC layer, and there shouldn't be a need to do "extra steps" to enable that functionality. But I agree with your point about designing an API which allows more options to be added in the future without breaking backward compatibility. However I think this can be made to work without requiring a separate call. I'll need to take a closer look at this. >>> - MAC_UNICAST_AUTO seems unnecessary to me. Why not just call first >>> with MAC_UNICAST_FACTORY and, if that fails, call again with >>> MAC_UNICAST_RANDOM? Doing that would even have better >>> functionality as MAC_UNICAST_AUTO seems to omit the possibility of >>> desiring a particular factory address when available. >> The intent was for AUTO to allow the slot to be specified. That option >> should allow the slot number to be specified via addr_slot. > > The document says it must be -1. Yes, and I need to fix the document to allow a slot number to be passed when that MAC address type is specified. >>> I think having MAC_UNICAST_AUTO in the mix ends up pushing some of >>> the control-path complexity out of the user space and into the >>> kernel. It'd be better to simplify the kernel parts. >> This is very simple logic we're talking about here, I don't see the >> problem doing that selection in kernel space. In addition, it avoids >> having two system calls per VNIC created on top of NICs which do not >> provide multiple factory MAC addresses. > > It's also duplicate logic. Why optimize for system call counts versus > kernel code complexity? There's additional code in the kernel, but that logic is very simple. >>> - What sorts of privileges are required to create and administer >>> VNICs? Are these things that can be delegated to non-global >>> zones? >> Basically the same that are needed for administrating other data-links, >> i.e. sys_net_config and net_rawaccess. In a zones environment data-link >> administration is limited to the global zone. > > That latter part might not be right for IP Instances, particularly > since VNICs can be built atop other VNICs. (Maybe that's just an > issue for the future, though.) Even with IP instances, data-link control remains in the global zone. >>> - Why is [V]NIC the right level of bandwidth control? If I want to >>> give a zone 100Mbps worth of bandwidth, but I'm giving it multiple >>> VNICs, how do I do that -- can the bandwidth control logic do >>> accounting based on multiple interfaces (aggregate control, rather >>> than individual interface control)? >> No, the bandwidth control is on a per-interface on a per-flow basis. >> This is because the bandwidth is basically controlled by polling on a >> per ring (software or hardware) basis, not across a set of rings. > > That's quite different from what most QoS implementations I've seen > do. The usual model is to map interfaces and flows into a "QoS > group," which is then controlled as a single unit, as in Cisco's > "qos-group" feature and policy maps. > > I'd suggest making sure that potential customers of this new bandwidth > control feature are keenly aware of the no-resource-aggregation > limitation. It sounds like it's intended as a fundamental design > feature, and not something that might be a temporary feature > limitation that could be removed later. (As a user, I wouldn't be > surprised to find that the controls at initial release don't match > what I actually need, but I'd be very surprised if the controls > couldn't be fixed later.) Yes, this will be of course fully documented. If we find an efficient way to do banwidth control across multiple rings in the future, I don't see why we wouldn't be able to made use of that functionality. Thanks, Nicolas. -- Nicolas Droux - Solaris Networking - Sun Microsystems, Inc. droux at sun.com - http://blogs.sun.com/droux