[crossbow-discuss] Updated Crossbow virtualization architecture document

Nicolas Droux Tue, 28 Aug 2007 16:47:17 -0600

Jim,

Thanks for the comments.

James Carlson wrote:
> Nicolas Droux writes:
>> http://opensolaris.org/os/project/crossbow/Docs/crossbow-virt.pdf
> 
> I have A few questions about this.  I've also read through as much of
> the crossbow-discuss archives as seemed to be related to these
> topics, and didn't find answers there.
> 
>   - Why are bandwidth, CPU control, and MAC address assignment
>     exclusively a VNIC feature, at least at the administrative level?
>     Section 4.7 seems to say that MAC instances will get these
>     features, so shouldn't this be "modify-dev" instead?

Bandwidth control, CPU mapping, fanout are not exclusive to VNICs. They 
will be expressed as properties, and applicable to non-VNIC data-links 
as well. This will be described in details by another upcoming document. 
I'll see what I can do to make that clearer in the virtualization 
document I sent out for review.

 From the administration interface point of view, there are two ways to 
associate properties with data-links. For data-links that are created 
through a dladm subcommand like create-vnic, the initial set of 
properties can be specified during the creation of the data-link itself 
through an dedicated option. In addition the properties can be set on 
any data-link through the set-linkprop subcommand. The former allows the 
administrator to create a VNIC with bandwidth control in a single 
command instead of having to go through a two step dance.

> 
>     Needing to create a "dummy" VNIC on top of a regular interface
>     just to interpose these new features seems like an implementation
>     artifact.

No, that won't be needed, see above.

> 
>   - I assume we need a redesign of the VLAN code in order to get
>     per-VLAN bandwidth control.  Is that redesign part of Crossbow, or
>     is it some later project?  In reading the archives, it seems that
>     it's been proposed as part of Crossbow, but in reading this
>     document it seems to be part of something else.

Yes, we're currently planning to move VLAN processing down to the MAC 
layer itself, and the VLAN processing currently in the DLS layer will be 
removed. This still needs to be properly documented.

> 
>   - If per-VLAN control appears, do the units of administration
>     change?  Does it then become reasonable to talk about bandwidth
>     and CPU control using "set-linkprop"?

Yes, the properties will apply to VLAN data-links as well, see above.

> 
>   - Do bandwidth and CPU controls rely on squeues?  If so, then VNICs
>     may not be able to control utilization from non-IP traffic, such
>     as with bridging.

There is a level of bandwidth control done by squeue, but there's also a 
bandwidth control done by the MAC layer itself. Which is useful when 
there's a need to do bandwidth control before fanout to multiple CPUs at 
the MAC layer, and also for non-IP protocols, or when the MAC is being 
used by a virtual machines back-end drivers in the host OS. See also 
Sunay's writeup at 
http://www.opensolaris.org/os/project/crossbow/Design_softringset.txt 
for more details on this topic.

>   - I'm not sure I understand the (undocumented? -- not in summary)
>     "-F" option for move-vnic.  If I'm using a factory address on one
>     NIC and I move a VNIC to another NIC, does this cause the VNIC to
>     continue using the _same_ address but just on a new NIC?
> 
>     If so, how is duplication avoided if that factory address is ever
>     reused from the original NIC?
> 
>     I would have expected that a VNIC using a factory address would
>     just get a *new* address during a forced move to a new NIC.
>     Changing MAC address during reconfiguration doesn't seem like a
>     disaster to me -- in fact, it seems expected.  Why should it try
>     to retain the address?

I was trying to allow the system administrator to minimize the impact on 
the existing MAC address assignment when moving a VNIC to be moved off 
and back to a device. But I agree that it's not optimal. If the folks on 
this list feel that the MAC address changing is not an issue, I've no 
problem using the simpler scheme of reassigning a new MAC address to the 
VNIC/MAC client.

>   - For showing statistics with "show-vnic -s", are these the same as
>     "show-link -s"?  If so, wouldn't the existing "show-link -s" do
>     the job?

Agreed, show-link -s should do fine here.

>   - What do "up" and "down" mean?  Are these equivalent to controlling
>     the "RUNNING" bit from user space (i.e., some way of marking link
>     up and link down manually)?  Or are they something else?  Should
>     regular MAC instances (other than VNICs) have the ability to be
>     set administratively up and down?
> 
>     What would happen if VNICs were always "up?"

Here it means causing the VNIC MACs to register with the framework. The 
same functionality already exists for link aggregations. Meem suggested 
init-vnic instead, which would be fine to me and avoid potential 
confusions with ifconfig up. I still need to update that part of the 
document.

> 
>   - What happens if a NIC is oversubscribed by the amount of bandwidth
>     configured for the VNICs?  Is the result proportionate (and thus
>     "fair") allocation, or do they compete on some other grounds?

In that case it will depend on other factors such as the type of 
traffic, the CPU(s) processing that traffic, etc.

> 
>     What kind of bandwidth control exists here?  How granular is it,
>     and what effects do clients see from restricted bandwidth?  Are
>     packets dropped (they have to be, if bandwidth limits apply to
>     forwarded traffic)?  If so, is it tail drop or something more
>     sophisticated?

In general if a SRS or flow is assigned its own hardware ring, then the 
polling thread will poll packets directly from the ring, and there's no 
dropping from the host. Packets will be polled from the rings when 
allowed as per bandwidth limits and consumption. The polling thread is 
scheduled every tick, and we compute a maximum number of bytes per tick.

If more than one SRS/squeue share a ring, there's no polling of the 
ring. Instead, traffic will be interrupt driven, and packets will be 
deposited on queues associated with the SRS/squeue. Packets are then 
pulled from these queues based on bandwidth limits. If the maximum 
number of packets in these queues is exceeded, then there's tail drop. 
Again, see the SRS design doc.

>   - Can a VNIC be built atop another non-anchor VNIC?  (Seems like the
>     answer is "yes.")

Correct.

> 
>   - When VNICs share rings due to a lack of hardware resources, what
>     happens when the client of one VNIC is using polling and the
>     client of the other one is not?
 >     Won't one client end up blanking the interrupts for another?

If there's one ring shared by multiple VNICs, traffic arrival will be 
interrupt based, and after software classification, traffic will be 
deposited to software rings.

If there are multiple hardware rings but only one interrupt, then the 
driver does not disable the hardware interrupt. Instead, it takes note 
of the request from the stack to not interrupt for specific rings. When 
a hardware interrupt is received, it avoids consuming packets from these 
rings, and continues delivering traffic to the MAC layer otherwise. 
Again, see the document on SRS and bandwidth control for more details.

>   - Instead of adding more arguments to mac_open() to handle priority
>     and bandwidth, I'd suggest making these separate calls.  You'll
>     need the separate call anyway to implement the "modify" mechanism.

Having the parameters specified in mac_open() is useful since they allow 
  these parameters to be specified when the resources are allocated to 
the MAC client. This avoids allocating a set of default resources and 
then immediately changing these resources through a separate modify 
mechanism. If we can specify through 2-3 arguments I don't think this 
should be an issue.

>   - What exactly does exclusive MAC access do?  If mac_exclusive_set
>     is called, are other client requests blocked (sleeping)?  Or are
>     they rejected (return error)?  Or are they just let through, and
>     all clients are expected to bracket requests with exclusive
>     set/clear calls?

This is basically the equivalent of the 
mac_active_set()/mac_active_clear() we have in Nevada today. I'm looking 
into whether the same semantics could be implemented indirectly through 
the mac_unicst_set() with the primary MAC address, since there's only 
one and it can be assigned only to one MAC client.

> 
>   - MAC_UNICAST_AUTO seems unnecessary to me.  Why not just call first
>     with MAC_UNICAST_FACTORY and, if that fails, call again with
>     MAC_UNICAST_RANDOM?  Doing that would even have better
>     functionality as MAC_UNICAST_AUTO seems to omit the possibility of
>     desiring a particular factory address when available.

The intent was for AUTO to allow the slot to be specified. That option 
should allow the slot number to be specified via addr_slot.

>     I think having MAC_UNICAST_AUTO in the mix ends up pushing some of
>     the control-path complexity out of the user space and into the
>     kernel.  It'd be better to simplify the kernel parts.

This is very simple logic we're talking about here, I don't see the 
problem doing that selection in kernel space. In addition, it avoids 
having two system calls per VNIC created on top of NICs which do not 
provide multiple factory MAC addresses.

>   - What sorts of privileges are required to create and administer
>     VNICs?  Are these things that can be delegated to non-global
>     zones?

Basically the same that are needed for administrating other data-links, 
i.e. sys_net_config and net_rawaccess. In a zones environment data-link 
administration is limited to the global zone.

>   - Why is [V]NIC the right level of bandwidth control?  If I want to
>     give a zone 100Mbps worth of bandwidth, but I'm giving it multiple
>     VNICs, how do I do that -- can the bandwidth control logic do
>     accounting based on multiple interfaces (aggregate control, rather
>     than individual interface control)?

No, the bandwidth control is on a per-interface on a per-flow basis. 
This is because the bandwidth is basically controlled by polling on a 
per ring (software or hardware) basis, not across a set of rings.

>     If I have application-level controls, such as HTTP virtual servers
>     or a sendmail configuration handling multiple domains, how can I
>     control bandwidth for those things?  Won't the application need to
>     be involved?

Then you will use flowadm(1M) which we are also introducing as part of 
Crossbow, and will be described separately. My document focuses on the 
virtualization aspects of the project.

Nicolas.

-- 
Nicolas Droux - Solaris Networking - Sun Microsystems, Inc.
droux at sun.com - http://blogs.sun.com/droux

[crossbow-discuss] Updated Crossbow virtualization architecture document

Reply via email to