[crossbow-discuss] Updated Crossbow virtualization architecture document

Nicolas Droux Wed, 29 Aug 2007 16:15:02 -0600

Jim,

James Carlson wrote:
> Nicolas Droux writes:
>>  From the administration interface point of view, there are two ways to 
>> associate properties with data-links. For data-links that are created 
>> through a dladm subcommand like create-vnic, the initial set of 
>> properties can be specified during the creation of the data-link itself 
>> through an dedicated option. In addition the properties can be set on 
>> any data-link through the set-linkprop subcommand. The former allows the 
>> administrator to create a VNIC with bandwidth control in a single 
>> command instead of having to go through a two step dance.
> 
> Does this mean that the same properties will be accessible via both
> "modify-vnic" and "set-linkprop"?
> 
> I can understand wanting to set some initial properties at create
> time, but it seems odd that the new general properties are segregated
> into VNIC-specific commands.


No, only set-linkprop will be used to change these properties, not 
modify-vnic. We'll send out updated man pages to reflect these changes, 
and they will be different than the man pages that were published as 
part of our current bits.

>>>   - Do bandwidth and CPU controls rely on squeues?  If so, then VNICs
>>>     may not be able to control utilization from non-IP traffic, such
>>>     as with bridging.
>> There is a level of bandwidth control done by squeue, but there's also a 
>> bandwidth control done by the MAC layer itself. Which is useful when 
>> there's a need to do bandwidth control before fanout to multiple CPUs at 
>> the MAC layer, and also for non-IP protocols, or when the MAC is being 
>> used by a virtual machines back-end drivers in the host OS. See also 
>> Sunay's writeup at 
>> http://www.opensolaris.org/os/project/crossbow/Design_softringset.txt 
>> for more details on this topic.
> 
> I had found and read that document before writing my comment.
> 
> I still don't quite see the relationship here.  What are the
> responsibilities of the two mechanisms (the mac layer and the
> squeues)?
> 
> To put the question in another way: suppose I have a non-IP protocol
> using a VNIC with a bandwidth control set on it.  What happens?  Are
> there features that were related to squeues that I won't be able to
> use?  If so, then what are those features?

The client will see a MAC which has a bandwidth limit, nothing else is 
required.

> Or, to put it another way still: are there things that non-IP
> protocols should or could be doing in order to "cooperate" with this
> bandwidth control so that they behave as well as IP's squeues will?

No, no special requirements. The bandwidth limits set on a MAC will be 
enforced by the MAC layer SRS. We'll also have a flow API which will be 
available to MAC clients to define bandwidth limits for services, etc, 
and used by clients like IP when needed.

>> I was trying to allow the system administrator to minimize the impact on 
>> the existing MAC address assignment when moving a VNIC to be moved off 
>> and back to a device. But I agree that it's not optimal. If the folks on 
>> this list feel that the MAC address changing is not an issue, I've no 
>> problem using the simpler scheme of reassigning a new MAC address to the 
>> VNIC/MAC client.
> 
> If I (as a system administrator) say "factory" as part of the
> configuration of the interface, then I'd expect to get a factory-
> supplied address.  My expectation would be that when the factory-
> supplied components are swapped out underneath, the address changes.

Actually there are three sub-cases to this I think:

1. If the administrator does not specify an address (automatic 
assignment), and a factory MAC address is assigned to the VNIC. In this 
case, I think it's fine to assign a different MAC address, e.g. a random 
one, to the VNIC if the VNIC is moved to a NIC which does not have 
available factory MAC addresses.

2. If the administrator requested a factory MAC addresses explicitly, 
then the VNIC could be moved to a different NIC which has an available 
factory MAC address. Otherwise the operation would fail unless a force 
flag is set.

3. If the administrator requested a factory MAC address of a specific 
slot, then there's a clear intent of using a specific MAC address of the 
device underneath. In that case the move operation would fail unless a 
force flag is set.

> Having the factory-supplied address come unmoored from the device
> itself seems odd to me, and almost certain to cause trouble.  I
> suppose it could be possible to create a "adopt the factory address
> and treat it as though it were my own statically-configured address"
> option, but I'd certainly want to see it come with adequate warnings
> about the dangers and a clear user interface (not "factory" but
> "steal-from-factory" ;-}).  I'm not sure that it'd be administratively
> interesting, though.

Yes, there's a risk of duplicate addresses if that option was chosen, 
and the source NIC ends-up being recycled later, that's less than ideal.

>>>   - What happens if a NIC is oversubscribed by the amount of bandwidth
>>>     configured for the VNICs?  Is the result proportionate (and thus
>>>     "fair") allocation, or do they compete on some other grounds?
>> In that case it will depend on other factors such as the type of 
>> traffic, the CPU(s) processing that traffic, etc.
> 
> I suggest putting more effort into characterizing this, because
> oversubscribing is a common and fairly well understood way to balance
> risk versus utilization and occurs often in handling failure scenarios
> (such as with aggregation).
> 
> I've seen similar schemes for access servers (most have proprietary
> RADIUS extensions for setting bandwidth limits), and the usual way
> this works is that once the link is saturated, the configured limits
> become shares.  Thus, the clients are all hurt in proportion to the
> amount of bandwidth they're given.

The limits are really used to clamp down on bandwidth utilization by a 
MAC, but they do not imply any guaranteed bandwidth. As a future 
deliverable we're also planning to provide bandwidth guarantees which is 
what you seem to be referring to here.

>>>     What kind of bandwidth control exists here?  How granular is it,
>>>     and what effects do clients see from restricted bandwidth?  Are
>>>     packets dropped (they have to be, if bandwidth limits apply to
>>>     forwarded traffic)?  If so, is it tail drop or something more
>>>     sophisticated?
>> In general if a SRS or flow is assigned its own hardware ring, then the 
>> polling thread will poll packets directly from the ring, and there's no 
>> dropping from the host. Packets will be polled from the rings when 
>> allowed as per bandwidth limits and consumption. The polling thread is 
>> scheduled every tick, and we compute a maximum number of bytes per tick.
>>
>> If more than one SRS/squeue share a ring, there's no polling of the 
>> ring. Instead, traffic will be interrupt driven, and packets will be 
>> deposited on queues associated with the SRS/squeue. Packets are then 
>> pulled from these queues based on bandwidth limits. If the maximum 
>> number of packets in these queues is exceeded, then there's tail drop. 
>> Again, see the SRS design doc.
> 
> "Tail drop" looks like the answer I was looking for.
> 
> In that case, you might want to consider (at least as an RFE)
> including basic RED support here.  There can be a big difference in
> behavior between hardware-imposed limits (ones that presumably affect
> both the sender and receiver in most cases) and artificial limits
> because the network behavior is quite different, and tail-drop is
> known to cause poor TCP performance.

Agreed. We still to document in more details our existing scheme here, 
and we should discuss alternatives as part of that text.

>>>   - Instead of adding more arguments to mac_open() to handle priority
>>>     and bandwidth, I'd suggest making these separate calls.  You'll
>>>     need the separate call anyway to implement the "modify" mechanism.
>> Having the parameters specified in mac_open() is useful since they allow 
>>   these parameters to be specified when the resources are allocated to 
>> the MAC client. This avoids allocating a set of default resources and 
>> then immediately changing these resources through a separate modify 
>> mechanism. If we can specify through 2-3 arguments I don't think this 
>> should be an issue.
> 
> I think it's much more flexible and easier to do it later.
> 
> You're going to need a function to change the values after mac_open()
> time.  By supplying the same values during mac_open(), you're just
> duplicating that functionality.

It might be a single "piece of code" which can be called to allocate 
resources according to these parameters from both the open and modify 
functions. I think the duplication can be avoided.

> Worse, mac_open() is a core function, while resource control is at the
> periphery.  If you need to modify mac_open() every time resource
> controls are tweaked -- consider what happens when shared resources
> are introduced (allowing control of multiple interfaces as a group),
> or when more advanced queuing disciplines are allowed -- then this
> interface will never settle down and never be appropriate as a DDI
> function.
> 
> Separating these two allows you to add new control functions in the
> future without having to modify every mac_open() caller.
> 
> It's as though every fcntl(2) feature needed to be supplied in
> open(2).
> 
> Why is the resource allocation itself an important thing to optimize
> versus the interface stability and scalability?

I don't agree with the "core function" vs "periphery" argument. The 
resource control is becoming an integral part of the MAC layer, and 
there shouldn't be a need to do "extra steps" to enable that functionality.

But I agree with your point about designing an API which allows more 
options to be added in the future without breaking backward 
compatibility. However I think this can be made to work without 
requiring a separate call. I'll need to take a closer look at this.

>>>   - MAC_UNICAST_AUTO seems unnecessary to me.  Why not just call first
>>>     with MAC_UNICAST_FACTORY and, if that fails, call again with
>>>     MAC_UNICAST_RANDOM?  Doing that would even have better
>>>     functionality as MAC_UNICAST_AUTO seems to omit the possibility of
>>>     desiring a particular factory address when available.
>> The intent was for AUTO to allow the slot to be specified. That option 
>> should allow the slot number to be specified via addr_slot.
> 
> The document says it must be -1.

Yes, and I need to fix the document to allow a slot number to be passed 
when that MAC address type is specified.

>>>     I think having MAC_UNICAST_AUTO in the mix ends up pushing some of
>>>     the control-path complexity out of the user space and into the
>>>     kernel.  It'd be better to simplify the kernel parts.
>> This is very simple logic we're talking about here, I don't see the 
>> problem doing that selection in kernel space. In addition, it avoids 
>> having two system calls per VNIC created on top of NICs which do not 
>> provide multiple factory MAC addresses.
> 
> It's also duplicate logic.  Why optimize for system call counts versus
> kernel code complexity?

There's additional code in the kernel, but that logic is very simple.

>>>   - What sorts of privileges are required to create and administer
>>>     VNICs?  Are these things that can be delegated to non-global
>>>     zones?
>> Basically the same that are needed for administrating other data-links, 
>> i.e. sys_net_config and net_rawaccess. In a zones environment data-link 
>> administration is limited to the global zone.
> 
> That latter part might not be right for IP Instances, particularly
> since VNICs can be built atop other VNICs.  (Maybe that's just an
> issue for the future, though.)

Even with IP instances, data-link control remains in the global zone.

>>>   - Why is [V]NIC the right level of bandwidth control?  If I want to
>>>     give a zone 100Mbps worth of bandwidth, but I'm giving it multiple
>>>     VNICs, how do I do that -- can the bandwidth control logic do
>>>     accounting based on multiple interfaces (aggregate control, rather
>>>     than individual interface control)?
>> No, the bandwidth control is on a per-interface on a per-flow basis. 
>> This is because the bandwidth is basically controlled by polling on a 
>> per ring (software or hardware) basis, not across a set of rings.
> 
> That's quite different from what most QoS implementations I've seen
> do.  The usual model is to map interfaces and flows into a "QoS
> group," which is then controlled as a single unit, as in Cisco's
> "qos-group" feature and policy maps.
> 
> I'd suggest making sure that potential customers of this new bandwidth
> control feature are keenly aware of the no-resource-aggregation
> limitation.  It sounds like it's intended as a fundamental design
> feature, and not something that might be a temporary feature
> limitation that could be removed later.  (As a user, I wouldn't be
> surprised to find that the controls at initial release don't match
> what I actually need, but I'd be very surprised if the controls
> couldn't be fixed later.)

Yes, this will be of course fully documented. If we find an efficient 
way to do banwidth control across multiple rings in the future, I don't 
see why we wouldn't be able to made use of that functionality.

Thanks,
Nicolas.

-- 
Nicolas Droux - Solaris Networking - Sun Microsystems, Inc.
droux at sun.com - http://blogs.sun.com/droux

[crossbow-discuss] Updated Crossbow virtualization architecture document

Reply via email to