Re: [networking-discuss] Issues for 2008/055

James Carlson Fri, 13 Feb 2009 14:04:43 -0800

[Note intentional redirect to networking-discuss; I'm using this fork
of the thread for design details rather than the system architecture.]

Nicolas Droux writes:
> I have included below a proposal which would allow bridging to
> cleanly integrate with mac without the introduction of special
> cases in the common data-path, and enabling bridging to
> leverage the layer-2 classification that was introduced
> by Crossbow. The proposal takes into account the issues we
> discussed earlier in this thread.

There's one slightly higher-level design point I'd like to get out of
the way.  There's an important distinction that we haven't been
drawing here that I think we need to include.  (And that, as I was
drafting this message, I saw that Sebastien actually mentioned on the
PSARC list.  ;-})

The distinction is between the administrative model (the commands and
behaviors that users expect from the system) and the actual internal
design.

I think we have at least rough agreement on the idea that it would be
good for the administrative model introduced by Crossbow to apply to
traffic flowing through bridges.  (If you want to get into detail
here, I'd actually prefer to see a much more traditional approach,
with distinct output queues, output queuing strategies, and
packet-marking classifiers, rather than the combined flow+properties
model used by Crossbow.  I understand that the flow model is in some
ways simpler, but I think it's potentially much less familiar, and
likely harder to integrate with other features.  But perhaps this is
too far off topic.)

Where we don't necessarily have agreement is the internal design bits
that allow us to provide that model.  I suspect that there are *many*
different ways to achieve this, and picking the best way is unclear.
For instance, instead of passing the packets from bridging through the
Crossbow classifier as the mechanism for forwarding, we could have the
bridge forwarding function call the classifier in order to identify
the applicable flow in the cases where we're going back out through
another interface, do the accounting as necessary, and then complete
the forwarding using the bridge's existing forwarding tables.

Doing that would avoid the potentially massive amount of data
duplication required by a Crossbow-based forwarding function.

> - per bridge table
> 
>       - based on existing bridging implementation, captures MAC
>         addresses associated with ports, time stamps needed
>         to track lifetime of entries, etc.
>       - table could be a flow table, but this is not the main
>         goal of this proposal.

The per-bridge table that I need and that you're describing is
identically the forwarding table.  It sounds like you're suggesting
that it's something else, and that the L2 forwarding is done elsewhere
(Crossbow classification?).  Is that correct?

If so, then I need to understand how the two parts would work
together, and what duplication (if any) is introduced here.

Today, what I have is a fairly simple AVL tree indexed on MAC address.
When any packet is seen (promiscuous mode required by standards), the
MAC source address is checked in the tree.  An entry is created if one
is not there, and the entry is updated to indicate that the port that
just received the packet is the right output for packets to be sent to
that destination, and the entry is timestamped.  This is the
"learning" function.

Next, the packet is looked up in the same AVL tree using the MAC
destination.  If an entry is found, then we send to the port it
indicates.  If no entry is found, then we send to all ports except the
input.  This is the "forwarding" function.

If I understand this new proposal, I would still have my AVL tree, but
the "learning" function would be changed so that if the learned port
appears to change, then we must delete N classifier entries (one for
each port on the bridge) for the old port (if any), then create N
separate classifier entries for the new output port.  Presumably, in
the normal steady-state case, there'd be only a few of these.

The "forwarding" function would be changed so that the classifier does
the work for known destinations (likely reentering the bridge code so
that we can check STP's forwarding flag as required) and unknown
destinations reenter the bridge code to be replicated as before.

Where before I would have M forwarding entries in the bridging code, I
would now have M learn-only entries in my private per-bridge AVL tree
plus M*N classifier entries in Crossbow.  The unfortunate matter here
is that M and N are potentially quite large in some cases.  'M' is the
number of MAC addresses we learn, which is a function of network size,
and is commonly in the hundreds.  'N' is the number of ports, which
could be driven up by the number of etherstubs or other virtual
entities used.  A dozen is not unreasonable, but larger numbers are
possible.  The result being anywhere from a few thousands of
classifier entries to perhaps a million or so.

Do I have the new proposal right?

> - new promiscuous callback flag, synchronous
>       
>       - bridge registers promiscuous callback with mac,
>         specifies (new) sync flag
>       - mac calls the registered callback with original packet
>         (read-only)
>       - bridge inspects packet to extract MAC addresses
>       - bridge updates its table according to MAC address
>       - bridge calls mac to add entries to the layer-2 flow tables
>         of the ports associated with the bridge.

Yes, that sounds like what I described above.  Note that the callback
must occur on both transmit and receive in order for correctness to be
preserved: raw DLPI clients can actually use source MAC addresses that
differ from the one configured on the link.

For what it's worth, my current design replaces mac_ring_tx with a
function that uses a very simple check.  If there's a bridge, it takes
a reference and calls into the bridge logic to handle output, which
ultimately calls back into mac through the (renamed) original
mac_ring_tx logic.  If there's no bridge, then it does a (tail-call
optimized) call directly into the original logic.  The code involves
an extra 8 instructions on SPARC.  I can probably pare that down a
bit.  Similar bits are mac_rx.  (And I had to move some of the
duplicated logic in the #defines into common bits.)

It sounds like the extra callback entry will likely have comparable
overhead in the bridging case, but _if_ this new callback is just a
new MAC client (of some sort) rather than a special case callback, it
might have lower overhead in the non-bridge case.

> - updates to the MAC layer-2 flows
> 
>       - entries are added and removed to and from the layer-2
>         classification table of a MAC instance (port) via a flow
>         API provided by mac to the bridging code.
>       - updates to these MAC classification tables are done
>         when addresses are added to/removed from the bridge table
>       - The callback function is a bridge processing function,
>         the cookie points to the bridge table entry for the
>         MAC address
>       - The flow addition/deletion API can be restructured
>         to no longer require the MAC perimeter to be held
>         when flow entries are added and deleted by the bridge.
>         This would allow the updates to be done without the
>         help of a worker thread, and would scale better with
>         frequent updates.
>       - The flow API can be extended to allow a set of flows
>         to be removed with a single call. A MAC client handle
>         and cookie could be used to identify the set of
>         flows to be removed.

What I actually need to do is to iterate over the learned entries and
then either (a) flush only those that are over a given age or (b)
delete "all" [per link, per TRILL nickname, or for the whole bridge]
on demand from STP or TRILL.  I don't believe there's a reason to
remove sets other than those.

The API restructuring to change the locking sounds significant.  Has
much work been done in this area?

> - TX/RX data path
> 
>       - does not require special processing on data-path
>       - RX and TX classification causes the MAC addresses
>         registered by the bridge to be matched against
>         the destination MAC address of sent and received packets
>       - packets are passed to bridge callback along with
>         registered cookie
>       - bridge gets packet, and knows destination to forward packet
>         to appropriate destination associated with the MAC address

I assume that "knows destination" here means that the already matched
classifier entry has a pointer that tells us which output to use.

>       - in the future, can takes advantage of TX fastpath
>         transparently

It's unclear to me how that could work without some substantial
changes.  Here's a simple case.  Suppose I have a bridge with hme0 and
ce0, and I plumb up hme0 for IP.  I then receive a packet from
0:1:2:3:4:5 on ce0.

If I transmit a packet through IP's "hme0" interface, and ARP
determines that this should go to 0:1:2:3:4:5, then the bridge
actually redirects that packet out through ce0.

This means that IP thinks that it's sending on hme0, and will set up
all of the hardware acceleration bits as though it were sending there,
but we actually go out on a complete different interface with
different hardware.

One nifty way to do this would be to rip out the "capability" stuff
from IP, and just have IP assume that all GLDv3-based interfaces have
all possible capabilities.  Then have the dls or mac layer (whatever's
appropriate) emulate the necessary hardware bits for those drivers
that lack the features.

But I don't know what's planned for the future here.

>       - introduce a new "no match" mac callback, invoked when there's
>         no match on a MAC address lookup. Currently this is hard-coded
>         in the data-path to send the packet on the wire, and
>         a more generic mechanism would allow a bridge to receive
>         a copy of such packets.

I'm unsure how multicast and broadcast are handled in the new design.
In bridging, they normally need to be handled as "unknown"
destinations.  Are they always "unknown" to Crossbow's classifier?  Or
do they come through multiple paths?

(To make things slightly more complicated, it's possible for bridges
to 'snoop' IGMP/MLD messages, and optimize multicast forwarding paths.
We're not currently doing this, but it's a common feature, and it'd be
difficult if the new design left this out.)

-- 
James Carlson, Solaris Networking              <[email protected]>
Sun Microsystems / 35 Network Drive        71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677
_______________________________________________
networking-discuss mailing list
[email protected]

Re: [networking-discuss] Issues for 2008/055

Reply via email to