[networking-discuss] bridging design review follow-up

James Carlson Thu, 28 May 2009 13:15:45 -0700

I've updated this document to version 1.1 according to the design
review feedback I've received:


  http://www.opensolaris.org/os/project/rbridges/bridging-design.pdf

Unfortunately, if you're on the "To:" list for this message, you now
have some extra work to do.

The C-team requirements say that I must get an *explicit* ok from each
reviewer that the changes satisfy the concerns raised.  If you're on
the "To:" list of this message, then you are one of those reviewers.
Please let me know -- one way or the other -- whether you agree with
these changes.  (Sorry to put you through that extra work after you
volunteered your time to help out with the review in the first place,
but the process demands it.)

If you want to see the old document, that's still available as
bridging-design-1.0.pdf.  The changes since then are:

  - Removal of the unneeded /dev/bridge/ mechanism (we now use
    /dev/net/).

  - New self-protection features to avoid table explosion and
    misbehaving links.

  - Clearer documentation of how FCS (CRC) is handled by bridges.

  - Notes about the optional package installation issues, Dynamic
    Reconfiguration, InfiniBand, Zones, and MTU change.

  - Addition of the obscure "mcheck" RSTP feature; needed mostly for
    testing.

  - Discussion of VNICs over the observability node (not currently
    supported).

  - Updates to disable Crossbow polling on a link when the link is
    placed in a bridge (due to discussion with the Crossbow team).

Some of these changes will require ARC updates, so I'll be filing a
fast-track to cover those things.  I'll hold off until I hear from the
indicated reviewers, so that I can be sure I've gotten all the
changes.

Noted below is a summary of all the comments received, as well as
information about how each one was handled.

ND      Nicolas Droux   <[email protected]>
DR      Darren Reed     <[email protected]>
CZ      Cathy Zhou      <[email protected]>
SR      Sebastien Roy   <[email protected]>
EB      Evtim Batchev   <[email protected]>
DB      David Brean     <[email protected]>

DR-1    Reading 1.4, how do I get to the 2nd paragraph from the 1st?
        What does the bridge modify?
        Even if it is talked about elsewhere in the document, some
        sort of reference (even if oblique) to what is changed/why
        would be useful to give this some extra context.

REPLY:  ACCEPT

        The first paragraph deals with ordinary bridges -- ones
        without VLAN support, which never need to modify packets in
        flight.  Since they never modify frames, they can just
        preserve FCS from input to output without recalculating.

        The second one deals with bridges with 802.1q (VLAN) support,
        which *do* modify the frame.  Modifications to the frame
        change the FCS.  The 802.1q tag is sometimes inserted or
        removed, depending on the port configuration.

        I can add references for the reasons things are modified.

DR-2    Reading 2.4 (Observability Node), it seems like the
        architecture is becoming rather obscure and complicated to
        achieve a somewhat simple goal.
        What other design choices did you consider for achieving this?
        Are there any better ways we might do it if only....?

REPLY:  ACCEPT

        After clarifying the issue, this ends up being essentially the
        same as CZ-3, below.  The answer is that /dev/bridge/ will be
        removed; /dev/net/ has slightly different semantics, but in
        this case is sufficient.

CZ-1    How bridging work with the current VNICs in crossbow? It is
        discussed in 2.3.2 but I am still not sure. If a packet
        belongs to a VNIC over a link comes, currently Crossbow VNICs
        implementation will do the internal switch and (possibly make
        its own copies if it is multicast packet) and pass this packet
        to the VNICs. If bridge is configured on the same link, will
        bridge do the same thing?

REPLY:  EXPLAIN

        Not exactly.  If it helps, you can think of bridging as taking
        place at effectively the same point in the stack as link
        aggregation, at least as far as VNICs are concerned.
        Crossbow's VNIC/flow identification must occur after the
        actual receiving NIC is identified (because its tables are
        per-NIC), and that isn't know until bridge forwarding is
        complete.

        It would in theory be possible for the bridge code to
        determine not just the proper link for a received packet, but
        also the specific VNIC based on MAC destination address.
        There are several reasons (given in the design document) why
        we don't do that today.  It's a plausible change in the
        future, though.

        [A different direction to go with this would be to have the
        VNIC code provide a way for lower-level entities to 'hint' at
        the right VNIC for reception.  That way, a hardware device
        that can tag a packet based on distinct destination MAC
        address could take advantage of this bit of acceleration, as
        could bridging.]

CZ-2    What if a link and a VNIC/VLAN over that link are put into a
        different zone?

REPLY:  EXPLAIN

        That's the same as things are today; no change.  Crossbow
        handles the VNIC and/or VLAN identification _after_ bridging
        is done, so you can go ahead and put your VNICs and/or VLANs
        in different zones if you want.  Bridging has nothing to do
        with that.

CZ-3    It sounds like the bridge will be another class of
        data-links. So why their observability nodes are not in
        /dev/net? Are you proposing a different option (not "-d") for
        snoop to snoop a bridge instance? Can the namespace of other
        data-links and the bridges (plumb the trailing '0') overlap?

REPLY:  ACCEPT

        After some discussion, the /dev/bridge/ directory is gone (as
        is the supporting extension to devfs), and we're using just
        the regular /dev/net/ entries for the observability node.

        The only difference is that the former /dev/bridge/ entries
        would be removed on bridge shutdown, and the current /dev/net/
        entries are removed only when the last using stream is shut
        down as well.  The difference is in a now-unimportant corner
        case, and the reduction of complexity is substantial, so we're
        making the change.

CZ-4    Will we allow creating VNICs over a bridge? I guess the answer
        is no, since I don't understand what that would be meaning.

REPLY:  EXPLAIN

        It would likely not make sense to create VNICs on the bridge
        observability node to use with 'ifconfig', as you can't plumb
        anything on this node, but it might (in theory at least) be
        useful as a way to place bridge observability nodes inside
        non-global zones.

        Currently, however, we do not plan to support it.  The VNIC
        creation code uses mac_unicast_add(), and that function fails
        (intentionally) on the observability node because transmit
        isn't supported.  In addition to that, use of bridges in
        non-global zones won't be supported, so there's little reason
        to do this.

        (Obviously, you can create VNICs using regular macs,
        regardless of whether those macs are assigned to a bridge.)

CZ-5    You discussed the link state of a bridge in 2.5. Who will be
        consumer of this state?

REPLY:  EXPLAIN

        Any snoop-like application using the bridge observability node
        could use this to report bridge up/down status, if it had the
        capability of using DL_NOTIFY_*.  (Such a capability in the
        application would be useful on ordinary links as well.)

        The link state information is there mostly because it was
        convenient and logical to make the aggregate status available
        like this.  (Plus, it makes testing MTU changes much easier.)

CZ-6    Is DR being considered? I mean if bridge is configured over a
        device, can that device be then DRed out?

REPLY:  ACCEPT

        We will update the SUNW_network_rcm plug-in to remove links
        from bridges when the links are removed from the system.

CZ-7    One question just comes to my mind is that whether the bridge
        starts to take effect once we create it? I asked because in
        theory other types of data-links (VNICs, aggregations) only
        take effect and starts the underlying MAC once we plumb it,
        and that would allow things like changing the MTU of
        underlying device much simpler. So I am wondering whether
        there is an equivalence of "plumbing" a bridge.

REPLY:  ACCEPT

        Not exactly, but you can stop and start the "bridge:<name>"
        SMF service, and that has the effect you're after.  When the
        SMF instance is disabled, the daemon stops running, the bridge
        instance is disconnected from all of its links, and is freed.

        Temporarily stopping the bridge instance is how I play around
        with MTU, and I'll add that to the design document.  I
        recommend setting MTUs coherently first, though, before
        configuring the bridge.

CZ-8    Can the namespace of other data-links and the bridges (plumb
        the trailing '0') overlap?

REPLY:  EXPLAIN

        No -- the bridge observability node is established as a link
        name, so you can't have another link with the same name.

CZ-9    I know Seb is going to add the dladm administration support in
        the non-global zones. I suppose after that, we will be able to
        create a bridge in the non-global zones too? Of course, we can
        only create bridges between data-links that belongs to that
        specific zone.

REPLY:  DEFER

        If at some point, you can create aggregations inside
        non-global zones, then it may make sense to talk about being
        able to configure bridges inside non-global zones.  Today,
        neither can be done.  The object that is assigned to a
        non-global zone isn't the underlying MAC used for bridging and
        aggregation (it's just a VLAN), and the privileges necessary
        to manipulate this configuration (sys_net_config) are excluded
        from non-global zones.

        [Much discussion on the list omitted; Seb's work is for
        tunnels, which are unrelated to this project, and there are
        substantial issues to resolve in general non-global zone
        administration of macs, even if it's a useful feature.
        Aggregations are out of scope, so bridging is as well.]

        I'll add a note to the futures section.

CZ-10   How do you tell whether a MAC is a real mac or virtualized
        MAC? Say a xnf data-link is DomU, how do you tell this does
        not correspond to a real physical device?

REPLY:  EXPLAIN

        You should be able to create bridges inside xVM DomU and
        similar environments.  Zones are different, though, because
        not all control is placed within the zone, even when marked as
        "exclusive IP stack."

        Currently, we can tell the type of device by looking at
        DATALINK_CLASS_* from dladm_name2info().  If it's something
        other than DATALINK_CLASS_PHYS, DATALINK_CLASS_AGGR, or
        DATALINK_CLASS_ETHERSTUB, then it's something we cannot use.

EB-1    Do we plan for any "protection" of the bridge forwarding
        table.

        A MAC flood, with random or carefully chosen MAC addresses may
        easily render the bridge useless or misbehaving.

        Reminiscence of good 'ole Switches with CDP table space
        exhausted come to my mind.

REPLY:  ACCEPT

        A settable learning rate limit and table size limit will be
        added to the design.

EB-2    Some things like sanity check on moving one node from one port
        of the bridge to the other, limiting rate of "newly
        discovered" nodes for inclusion in the forwarding table
        (probably might be limited on Cross Bow level), etc etc.

        [W]hile there are a lot of legitimate reasons one MAC address
        to move sides of a bridge this will not happen _that_
        often. So what I wanted to ask (rather then say here) is there
        a way to guarantee (at least to a reasonable amount) that a
        move is legit... Like the bridge should see "n" (n less than
        give up on retransmit) targeted communications from a node
        without any sign of life from the other side in order to move
        teh side.  And that is only for active nodes ... maybe

REPLY:  DEFER

        That sounds like a future project to me.  At least I don't
        think I understand the scope of what you're talking about.

        If we don't switch "immediately," then all sorts of things
        won't work right.  For instance, a link failure or recovery
        within the network can easily cause all of the nodes that once
        appeared to be on one link to appear to be on another link.
        Failing to update those learned entries on such a switch means
        that we're actively compounding the problem by making recovery
        from the situation take longer.

        And we're doing it without actually improving much about
        security, as an attacker could easily just send messages
        periodically in an attempt to mess things up.  And (of course)
        if you're on a network with a repeater/hub, then you have no
        such protection at all, so this is really an existing unsolved
        problem.

        As an alternative, I suspect that it would be possible to
        enter a "hold-down" mode when you see an entry switch back and
        forth multiple times in a short period of time.  I think
        that'd require some more research, though, to get right.

        I'll add notes about this to the section on futures.

EB-3    I am not thinking only on the "amusement" side effect but when
        combined with VLANs it may bring "practical" effects.

REPLY:  EXPLAIN

        VLANs are addressed in the document.  The design document
        describes independent VLAN learning, so one VLAN doesn't
        affect how another operates.

DB-1    Can this be used to bridge IPoIB or EoIB and Eth?  On the IB
        side, the connectivity could easily behave like a hub, but
        might be enhanced to behave like a switch.

REPLY:  DEFER

        As long as it looks like an Ethernet mac in GLDv3, I'd assume
        yes.  I don't have access to any test hardware to try it out,
        though, and it doesn't seem that OpenSolaris currently
        supports EoIB.  I'll test when possible.

        (The expectation would be that IPoIB won't work, as that's
        clearly not Ethernet, and only Ethernet can be bridged via the
        802 mechanisms we're using.  EoIB should work, and anything
        else will be treated as a bug.)

ND-1    We already talked about having more common MAC hooks that
        would allow consumers like bridging and L2 filtering intercept
        and inject packets without having to change the common
        data-path. I'll start defining those soon.

REPLY:  OK

ND-2    For the RX path you need to handle the poll path as well or
        you will miss packets, see mac_rx_srs_poll_ring(). I didn't
        see that covered in your design doc or code. mac_rx() covers
        the interrupt path only.

REPLY:  ACCEPT

        After much discussion (and help from Venu), we've settled on
        creating an internal mac_poll_state_change() function that
        sets/clears MIS_POLL_DISABLE in mi_state_flags, and then calls
        mac_client_update_classifier() for each of the client streams.
        The new function will be called from mac_bridge_set() and
        mac_bridge_clear().  This will disable poll mode when bridging
        is enabled.

ND-3    BTW this is another example why generic MAC hooks would be
        useful to have, i.e. you wouldn't have to deal with all the
        details of the data-path.

REPLY:  OK

ND-4    For the TX path we're going to modify the TX default
        processing to use a fanout of rings instead of a single
        default TX ring to better scale. So some of the changes you
        are making related to the TX entry points would have to be
        redone. It would be better to not have the callers specify the
        default TX ring.

REPLY:  OK

        Agreed; I think the simpler way would be to have a single
        mac_tx function that knows how to deal with rings when
        necessary.

SR-1    [Regarding discussion on CZ-3.]

        There seems to be some confusion.  /dev/net contains
        full-fledged DLPI nodes for datalinks.  /dev/ipnet contains
        observability DLPI nodes for IP interfaces.  I think cathy
        mentioned /dev/net, but your answer seems as if you thought
        she mentioned /dev/ipnet...

REPLY:  ACCEPT

        Oops.  You're quite right.  The /dev/net/ name confuses me
        because we use "dl" everywhere else.  :-/

        We're using /dev/net/ for observability and ioctl access to
        the bridging feature in the kernel.

-- 
James Carlson, Solaris Networking              <[email protected]>
Sun Microsystems / 35 Network Drive        71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677
_______________________________________________
networking-discuss mailing list
[email protected]

[networking-discuss] bridging design review follow-up

Reply via email to