I've updated this document to version 1.1 according to the design review feedback I've received:
http://www.opensolaris.org/os/project/rbridges/bridging-design.pdf Unfortunately, if you're on the "To:" list for this message, you now have some extra work to do. The C-team requirements say that I must get an *explicit* ok from each reviewer that the changes satisfy the concerns raised. If you're on the "To:" list of this message, then you are one of those reviewers. Please let me know -- one way or the other -- whether you agree with these changes. (Sorry to put you through that extra work after you volunteered your time to help out with the review in the first place, but the process demands it.) If you want to see the old document, that's still available as bridging-design-1.0.pdf. The changes since then are: - Removal of the unneeded /dev/bridge/ mechanism (we now use /dev/net/). - New self-protection features to avoid table explosion and misbehaving links. - Clearer documentation of how FCS (CRC) is handled by bridges. - Notes about the optional package installation issues, Dynamic Reconfiguration, InfiniBand, Zones, and MTU change. - Addition of the obscure "mcheck" RSTP feature; needed mostly for testing. - Discussion of VNICs over the observability node (not currently supported). - Updates to disable Crossbow polling on a link when the link is placed in a bridge (due to discussion with the Crossbow team). Some of these changes will require ARC updates, so I'll be filing a fast-track to cover those things. I'll hold off until I hear from the indicated reviewers, so that I can be sure I've gotten all the changes. Noted below is a summary of all the comments received, as well as information about how each one was handled. ND Nicolas Droux <[email protected]> DR Darren Reed <[email protected]> CZ Cathy Zhou <[email protected]> SR Sebastien Roy <[email protected]> EB Evtim Batchev <[email protected]> DB David Brean <[email protected]> DR-1 Reading 1.4, how do I get to the 2nd paragraph from the 1st? What does the bridge modify? Even if it is talked about elsewhere in the document, some sort of reference (even if oblique) to what is changed/why would be useful to give this some extra context. REPLY: ACCEPT The first paragraph deals with ordinary bridges -- ones without VLAN support, which never need to modify packets in flight. Since they never modify frames, they can just preserve FCS from input to output without recalculating. The second one deals with bridges with 802.1q (VLAN) support, which *do* modify the frame. Modifications to the frame change the FCS. The 802.1q tag is sometimes inserted or removed, depending on the port configuration. I can add references for the reasons things are modified. DR-2 Reading 2.4 (Observability Node), it seems like the architecture is becoming rather obscure and complicated to achieve a somewhat simple goal. What other design choices did you consider for achieving this? Are there any better ways we might do it if only....? REPLY: ACCEPT After clarifying the issue, this ends up being essentially the same as CZ-3, below. The answer is that /dev/bridge/ will be removed; /dev/net/ has slightly different semantics, but in this case is sufficient. CZ-1 How bridging work with the current VNICs in crossbow? It is discussed in 2.3.2 but I am still not sure. If a packet belongs to a VNIC over a link comes, currently Crossbow VNICs implementation will do the internal switch and (possibly make its own copies if it is multicast packet) and pass this packet to the VNICs. If bridge is configured on the same link, will bridge do the same thing? REPLY: EXPLAIN Not exactly. If it helps, you can think of bridging as taking place at effectively the same point in the stack as link aggregation, at least as far as VNICs are concerned. Crossbow's VNIC/flow identification must occur after the actual receiving NIC is identified (because its tables are per-NIC), and that isn't know until bridge forwarding is complete. It would in theory be possible for the bridge code to determine not just the proper link for a received packet, but also the specific VNIC based on MAC destination address. There are several reasons (given in the design document) why we don't do that today. It's a plausible change in the future, though. [A different direction to go with this would be to have the VNIC code provide a way for lower-level entities to 'hint' at the right VNIC for reception. That way, a hardware device that can tag a packet based on distinct destination MAC address could take advantage of this bit of acceleration, as could bridging.] CZ-2 What if a link and a VNIC/VLAN over that link are put into a different zone? REPLY: EXPLAIN That's the same as things are today; no change. Crossbow handles the VNIC and/or VLAN identification _after_ bridging is done, so you can go ahead and put your VNICs and/or VLANs in different zones if you want. Bridging has nothing to do with that. CZ-3 It sounds like the bridge will be another class of data-links. So why their observability nodes are not in /dev/net? Are you proposing a different option (not "-d") for snoop to snoop a bridge instance? Can the namespace of other data-links and the bridges (plumb the trailing '0') overlap? REPLY: ACCEPT After some discussion, the /dev/bridge/ directory is gone (as is the supporting extension to devfs), and we're using just the regular /dev/net/ entries for the observability node. The only difference is that the former /dev/bridge/ entries would be removed on bridge shutdown, and the current /dev/net/ entries are removed only when the last using stream is shut down as well. The difference is in a now-unimportant corner case, and the reduction of complexity is substantial, so we're making the change. CZ-4 Will we allow creating VNICs over a bridge? I guess the answer is no, since I don't understand what that would be meaning. REPLY: EXPLAIN It would likely not make sense to create VNICs on the bridge observability node to use with 'ifconfig', as you can't plumb anything on this node, but it might (in theory at least) be useful as a way to place bridge observability nodes inside non-global zones. Currently, however, we do not plan to support it. The VNIC creation code uses mac_unicast_add(), and that function fails (intentionally) on the observability node because transmit isn't supported. In addition to that, use of bridges in non-global zones won't be supported, so there's little reason to do this. (Obviously, you can create VNICs using regular macs, regardless of whether those macs are assigned to a bridge.) CZ-5 You discussed the link state of a bridge in 2.5. Who will be consumer of this state? REPLY: EXPLAIN Any snoop-like application using the bridge observability node could use this to report bridge up/down status, if it had the capability of using DL_NOTIFY_*. (Such a capability in the application would be useful on ordinary links as well.) The link state information is there mostly because it was convenient and logical to make the aggregate status available like this. (Plus, it makes testing MTU changes much easier.) CZ-6 Is DR being considered? I mean if bridge is configured over a device, can that device be then DRed out? REPLY: ACCEPT We will update the SUNW_network_rcm plug-in to remove links from bridges when the links are removed from the system. CZ-7 One question just comes to my mind is that whether the bridge starts to take effect once we create it? I asked because in theory other types of data-links (VNICs, aggregations) only take effect and starts the underlying MAC once we plumb it, and that would allow things like changing the MTU of underlying device much simpler. So I am wondering whether there is an equivalence of "plumbing" a bridge. REPLY: ACCEPT Not exactly, but you can stop and start the "bridge:<name>" SMF service, and that has the effect you're after. When the SMF instance is disabled, the daemon stops running, the bridge instance is disconnected from all of its links, and is freed. Temporarily stopping the bridge instance is how I play around with MTU, and I'll add that to the design document. I recommend setting MTUs coherently first, though, before configuring the bridge. CZ-8 Can the namespace of other data-links and the bridges (plumb the trailing '0') overlap? REPLY: EXPLAIN No -- the bridge observability node is established as a link name, so you can't have another link with the same name. CZ-9 I know Seb is going to add the dladm administration support in the non-global zones. I suppose after that, we will be able to create a bridge in the non-global zones too? Of course, we can only create bridges between data-links that belongs to that specific zone. REPLY: DEFER If at some point, you can create aggregations inside non-global zones, then it may make sense to talk about being able to configure bridges inside non-global zones. Today, neither can be done. The object that is assigned to a non-global zone isn't the underlying MAC used for bridging and aggregation (it's just a VLAN), and the privileges necessary to manipulate this configuration (sys_net_config) are excluded from non-global zones. [Much discussion on the list omitted; Seb's work is for tunnels, which are unrelated to this project, and there are substantial issues to resolve in general non-global zone administration of macs, even if it's a useful feature. Aggregations are out of scope, so bridging is as well.] I'll add a note to the futures section. CZ-10 How do you tell whether a MAC is a real mac or virtualized MAC? Say a xnf data-link is DomU, how do you tell this does not correspond to a real physical device? REPLY: EXPLAIN You should be able to create bridges inside xVM DomU and similar environments. Zones are different, though, because not all control is placed within the zone, even when marked as "exclusive IP stack." Currently, we can tell the type of device by looking at DATALINK_CLASS_* from dladm_name2info(). If it's something other than DATALINK_CLASS_PHYS, DATALINK_CLASS_AGGR, or DATALINK_CLASS_ETHERSTUB, then it's something we cannot use. EB-1 Do we plan for any "protection" of the bridge forwarding table. A MAC flood, with random or carefully chosen MAC addresses may easily render the bridge useless or misbehaving. Reminiscence of good 'ole Switches with CDP table space exhausted come to my mind. REPLY: ACCEPT A settable learning rate limit and table size limit will be added to the design. EB-2 Some things like sanity check on moving one node from one port of the bridge to the other, limiting rate of "newly discovered" nodes for inclusion in the forwarding table (probably might be limited on Cross Bow level), etc etc. [W]hile there are a lot of legitimate reasons one MAC address to move sides of a bridge this will not happen _that_ often. So what I wanted to ask (rather then say here) is there a way to guarantee (at least to a reasonable amount) that a move is legit... Like the bridge should see "n" (n less than give up on retransmit) targeted communications from a node without any sign of life from the other side in order to move teh side. And that is only for active nodes ... maybe REPLY: DEFER That sounds like a future project to me. At least I don't think I understand the scope of what you're talking about. If we don't switch "immediately," then all sorts of things won't work right. For instance, a link failure or recovery within the network can easily cause all of the nodes that once appeared to be on one link to appear to be on another link. Failing to update those learned entries on such a switch means that we're actively compounding the problem by making recovery from the situation take longer. And we're doing it without actually improving much about security, as an attacker could easily just send messages periodically in an attempt to mess things up. And (of course) if you're on a network with a repeater/hub, then you have no such protection at all, so this is really an existing unsolved problem. As an alternative, I suspect that it would be possible to enter a "hold-down" mode when you see an entry switch back and forth multiple times in a short period of time. I think that'd require some more research, though, to get right. I'll add notes about this to the section on futures. EB-3 I am not thinking only on the "amusement" side effect but when combined with VLANs it may bring "practical" effects. REPLY: EXPLAIN VLANs are addressed in the document. The design document describes independent VLAN learning, so one VLAN doesn't affect how another operates. DB-1 Can this be used to bridge IPoIB or EoIB and Eth? On the IB side, the connectivity could easily behave like a hub, but might be enhanced to behave like a switch. REPLY: DEFER As long as it looks like an Ethernet mac in GLDv3, I'd assume yes. I don't have access to any test hardware to try it out, though, and it doesn't seem that OpenSolaris currently supports EoIB. I'll test when possible. (The expectation would be that IPoIB won't work, as that's clearly not Ethernet, and only Ethernet can be bridged via the 802 mechanisms we're using. EoIB should work, and anything else will be treated as a bug.) ND-1 We already talked about having more common MAC hooks that would allow consumers like bridging and L2 filtering intercept and inject packets without having to change the common data-path. I'll start defining those soon. REPLY: OK ND-2 For the RX path you need to handle the poll path as well or you will miss packets, see mac_rx_srs_poll_ring(). I didn't see that covered in your design doc or code. mac_rx() covers the interrupt path only. REPLY: ACCEPT After much discussion (and help from Venu), we've settled on creating an internal mac_poll_state_change() function that sets/clears MIS_POLL_DISABLE in mi_state_flags, and then calls mac_client_update_classifier() for each of the client streams. The new function will be called from mac_bridge_set() and mac_bridge_clear(). This will disable poll mode when bridging is enabled. ND-3 BTW this is another example why generic MAC hooks would be useful to have, i.e. you wouldn't have to deal with all the details of the data-path. REPLY: OK ND-4 For the TX path we're going to modify the TX default processing to use a fanout of rings instead of a single default TX ring to better scale. So some of the changes you are making related to the TX entry points would have to be redone. It would be better to not have the callers specify the default TX ring. REPLY: OK Agreed; I think the simpler way would be to have a single mac_tx function that knows how to deal with rings when necessary. SR-1 [Regarding discussion on CZ-3.] There seems to be some confusion. /dev/net contains full-fledged DLPI nodes for datalinks. /dev/ipnet contains observability DLPI nodes for IP interfaces. I think cathy mentioned /dev/net, but your answer seems as if you thought she mentioned /dev/ipnet... REPLY: ACCEPT Oops. You're quite right. The /dev/net/ name confuses me because we use "dl" everywhere else. :-/ We're using /dev/net/ for observability and ioctl access to the bridging feature in the kernel. -- James Carlson, Solaris Networking <[email protected]> Sun Microsystems / 35 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677 _______________________________________________ networking-discuss mailing list [email protected]
