[Note intentional redirect to networking-discuss; I'm using this fork of the thread for design details rather than the system architecture.]
Nicolas Droux writes: > I have included below a proposal which would allow bridging to > cleanly integrate with mac without the introduction of special > cases in the common data-path, and enabling bridging to > leverage the layer-2 classification that was introduced > by Crossbow. The proposal takes into account the issues we > discussed earlier in this thread. There's one slightly higher-level design point I'd like to get out of the way. There's an important distinction that we haven't been drawing here that I think we need to include. (And that, as I was drafting this message, I saw that Sebastien actually mentioned on the PSARC list. ;-}) The distinction is between the administrative model (the commands and behaviors that users expect from the system) and the actual internal design. I think we have at least rough agreement on the idea that it would be good for the administrative model introduced by Crossbow to apply to traffic flowing through bridges. (If you want to get into detail here, I'd actually prefer to see a much more traditional approach, with distinct output queues, output queuing strategies, and packet-marking classifiers, rather than the combined flow+properties model used by Crossbow. I understand that the flow model is in some ways simpler, but I think it's potentially much less familiar, and likely harder to integrate with other features. But perhaps this is too far off topic.) Where we don't necessarily have agreement is the internal design bits that allow us to provide that model. I suspect that there are *many* different ways to achieve this, and picking the best way is unclear. For instance, instead of passing the packets from bridging through the Crossbow classifier as the mechanism for forwarding, we could have the bridge forwarding function call the classifier in order to identify the applicable flow in the cases where we're going back out through another interface, do the accounting as necessary, and then complete the forwarding using the bridge's existing forwarding tables. Doing that would avoid the potentially massive amount of data duplication required by a Crossbow-based forwarding function. > - per bridge table > > - based on existing bridging implementation, captures MAC > addresses associated with ports, time stamps needed > to track lifetime of entries, etc. > - table could be a flow table, but this is not the main > goal of this proposal. The per-bridge table that I need and that you're describing is identically the forwarding table. It sounds like you're suggesting that it's something else, and that the L2 forwarding is done elsewhere (Crossbow classification?). Is that correct? If so, then I need to understand how the two parts would work together, and what duplication (if any) is introduced here. Today, what I have is a fairly simple AVL tree indexed on MAC address. When any packet is seen (promiscuous mode required by standards), the MAC source address is checked in the tree. An entry is created if one is not there, and the entry is updated to indicate that the port that just received the packet is the right output for packets to be sent to that destination, and the entry is timestamped. This is the "learning" function. Next, the packet is looked up in the same AVL tree using the MAC destination. If an entry is found, then we send to the port it indicates. If no entry is found, then we send to all ports except the input. This is the "forwarding" function. If I understand this new proposal, I would still have my AVL tree, but the "learning" function would be changed so that if the learned port appears to change, then we must delete N classifier entries (one for each port on the bridge) for the old port (if any), then create N separate classifier entries for the new output port. Presumably, in the normal steady-state case, there'd be only a few of these. The "forwarding" function would be changed so that the classifier does the work for known destinations (likely reentering the bridge code so that we can check STP's forwarding flag as required) and unknown destinations reenter the bridge code to be replicated as before. Where before I would have M forwarding entries in the bridging code, I would now have M learn-only entries in my private per-bridge AVL tree plus M*N classifier entries in Crossbow. The unfortunate matter here is that M and N are potentially quite large in some cases. 'M' is the number of MAC addresses we learn, which is a function of network size, and is commonly in the hundreds. 'N' is the number of ports, which could be driven up by the number of etherstubs or other virtual entities used. A dozen is not unreasonable, but larger numbers are possible. The result being anywhere from a few thousands of classifier entries to perhaps a million or so. Do I have the new proposal right? > - new promiscuous callback flag, synchronous > > - bridge registers promiscuous callback with mac, > specifies (new) sync flag > - mac calls the registered callback with original packet > (read-only) > - bridge inspects packet to extract MAC addresses > - bridge updates its table according to MAC address > - bridge calls mac to add entries to the layer-2 flow tables > of the ports associated with the bridge. Yes, that sounds like what I described above. Note that the callback must occur on both transmit and receive in order for correctness to be preserved: raw DLPI clients can actually use source MAC addresses that differ from the one configured on the link. For what it's worth, my current design replaces mac_ring_tx with a function that uses a very simple check. If there's a bridge, it takes a reference and calls into the bridge logic to handle output, which ultimately calls back into mac through the (renamed) original mac_ring_tx logic. If there's no bridge, then it does a (tail-call optimized) call directly into the original logic. The code involves an extra 8 instructions on SPARC. I can probably pare that down a bit. Similar bits are mac_rx. (And I had to move some of the duplicated logic in the #defines into common bits.) It sounds like the extra callback entry will likely have comparable overhead in the bridging case, but _if_ this new callback is just a new MAC client (of some sort) rather than a special case callback, it might have lower overhead in the non-bridge case. > - updates to the MAC layer-2 flows > > - entries are added and removed to and from the layer-2 > classification table of a MAC instance (port) via a flow > API provided by mac to the bridging code. > - updates to these MAC classification tables are done > when addresses are added to/removed from the bridge table > - The callback function is a bridge processing function, > the cookie points to the bridge table entry for the > MAC address > - The flow addition/deletion API can be restructured > to no longer require the MAC perimeter to be held > when flow entries are added and deleted by the bridge. > This would allow the updates to be done without the > help of a worker thread, and would scale better with > frequent updates. > - The flow API can be extended to allow a set of flows > to be removed with a single call. A MAC client handle > and cookie could be used to identify the set of > flows to be removed. What I actually need to do is to iterate over the learned entries and then either (a) flush only those that are over a given age or (b) delete "all" [per link, per TRILL nickname, or for the whole bridge] on demand from STP or TRILL. I don't believe there's a reason to remove sets other than those. The API restructuring to change the locking sounds significant. Has much work been done in this area? > - TX/RX data path > > - does not require special processing on data-path > - RX and TX classification causes the MAC addresses > registered by the bridge to be matched against > the destination MAC address of sent and received packets > - packets are passed to bridge callback along with > registered cookie > - bridge gets packet, and knows destination to forward packet > to appropriate destination associated with the MAC address I assume that "knows destination" here means that the already matched classifier entry has a pointer that tells us which output to use. > - in the future, can takes advantage of TX fastpath > transparently It's unclear to me how that could work without some substantial changes. Here's a simple case. Suppose I have a bridge with hme0 and ce0, and I plumb up hme0 for IP. I then receive a packet from 0:1:2:3:4:5 on ce0. If I transmit a packet through IP's "hme0" interface, and ARP determines that this should go to 0:1:2:3:4:5, then the bridge actually redirects that packet out through ce0. This means that IP thinks that it's sending on hme0, and will set up all of the hardware acceleration bits as though it were sending there, but we actually go out on a complete different interface with different hardware. One nifty way to do this would be to rip out the "capability" stuff from IP, and just have IP assume that all GLDv3-based interfaces have all possible capabilities. Then have the dls or mac layer (whatever's appropriate) emulate the necessary hardware bits for those drivers that lack the features. But I don't know what's planned for the future here. > - introduce a new "no match" mac callback, invoked when there's > no match on a MAC address lookup. Currently this is hard-coded > in the data-path to send the packet on the wire, and > a more generic mechanism would allow a bridge to receive > a copy of such packets. I'm unsure how multicast and broadcast are handled in the new design. In bridging, they normally need to be handled as "unknown" destinations. Are they always "unknown" to Crossbow's classifier? Or do they come through multiple paths? (To make things slightly more complicated, it's possible for bridges to 'snoop' IGMP/MLD messages, and optimize multicast forwarding paths. We're not currently doing this, but it's a common feature, and it'd be difficult if the new design left this out.) -- James Carlson, Solaris Networking <[email protected]> Sun Microsystems / 35 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677 _______________________________________________ networking-discuss mailing list [email protected]
