> On 30 Jul 2019, at 6:28 pm, Remi Locherer <remi.loche...@relo.ch> wrote:
> 
> On Tue, Jul 30, 2019 at 01:36:59PM +1000, David Gwynne wrote:
>> a Two-Port MAC Relay is basically a cut down bridge(4). it only supports
>> two ports, and unconditionally relays packets between those ports
>> instead of doing learning or anything like that.
>> 
>> i've been trying to get a redundant pair of bridges set up between two
>> datacenters here to help me while i migrate between them. so far all my
>> efforts to make it redundant have mostly worked, until they introduced
>> loops in the layer 2 topology, which generates a broadcast storm, which
>> basically takes the net down for a few minutes at a time. it's feels
>> very betraying.
>> 
>> my frustration is that switches plugged together have mechanisms to
>> prevent loops like that, more specifically they use spanning tree or
>> lacp to make appropriate use of redundant links. i got to a point where
>> i just wanted the switches to talk to each other and do their own thing
>> to negotiate use of the redundant links.
>> 
>> unfortunately the only way to get ethernet packets off a physical
>> wire and onto a tunnel over an ip network is bridge(4), and bridge(4)
>> tries to be a compliant switch from a standards point of view. this
>> means it intercepts packets that are meant to be processed by bridges,
>> because it is a bridge. these types of packets include spanning tree and
>> lacp, which means i couldnt get the physical switches at each site to
>> talk to each other. sadface.
>> 
>> so to solve my problem i hacked up a small driver that did less than
>> bridge(4). however, it turns out that what i hacked up is an actual
>> thing that already exists as something done in the real world. IEEE
>> 802.1Q describes TPMR, which is defined as intercepting far less
>> than a real bridge does. one of the appendices specifically describes
>> lacp going through one, which is exactly what i wanted. cisco does
>> something like this with their layer 2 cross-connects (search for cisco
>> xconnect for examples), juniper has l2circuits, and so on.
>> 
>> the way i'm using this is like below. i have a pair of bridges in each
>> datacenter, so 4 boxes in total. they peer directly with the ip network
>> that sits between the datacenter. each box has a 4 physical network
>> ports. 2 of those ports are configured with aggr(4) and talk IP into the
>> core network. the other two ports are connected to the switches at
>> each site for use with tpmr. there's 2 etherip interfaces configured on
>> each physical box, each of which is connected to the tpmr.
>> 
>> all that together looks a bit like the following:
>> 
>> +-+ +--------------------------+      +---------------------------+ +-+
>> |d|-|ix2 <-> tpmr0 <-> etherip0|------|etherip0 <-> tpmr0 <-> ixl0|-|d|
>> |c| |                          |      |                           | |c|
>> |0|-|ix3 <-> tpmr1 <-> etherip1|-    -|etherip1 <-> tpmr1 <-> ixl1|-|1|
>> ||| +--------------------------+ \  / +---------------------------+ |||
>> |s|         dc0-bridge0           \/          dc1-bridge0           |s|
>> |w|                               /\                                |w|
>> |i| +--------------------------+ /  \ +---------------------------+ |i|
>> |t|-|ix2 <-> tpmr0 <-> etherip0|-    -|etherip0 <-> tpmr0 <-> ixl0|-|t|
>> |c| |                          |      |                           | |c|
>> |h|-|ix3 <-> tpmr1 <-> etherip1|------|etherip1 <-> tpmr1 <-> ixl1|-|h|
>> +-+ +--------------------------+      +---------------------------+ +-+
>>             dc0-bridge1                       dc1-bridge1
>> 
>> each switch has a 4 port port-channel (lacp aggregation) set up. because
>> each physical interface on the bridges are tied to a single tunnel, the
>> packets effectively traverse a point-to-point link, ie, a really
>> complicated wire. because lacp makes it from each point to the other
>> point, the switches make sure only active lacp ports are used, which
>> avoids layer 2 loops. lacp also means i get to use all the links when
>> theyre available.
>> 
>> with the topology above i can lose a bridge at each site and should
>> still have a working link to the other side, so i get my redundancy. the
>> use of the extra links with lacp is a bonus. at this point i would have
>> been happy for spanning tree to shut links down.
>> 
>> anyway, here's the code.
>> 
>> it was originally called xcon(4) since it provides a software
>> cross-connect, but i changed my mind after looking at 802.1Q. it might
>> be unfair to refer to 802.1Q because tpmr(4) does none of the filtering
>> that the spec says it should. i just needed it to work though.
>> 
>> the guts of it is tpmr_input(). it basically gets the rxed packet from
>> one port and enqueues it for tranmission immediately on the other port.
>> it does run bpf though, and supports filtering on bpf, which has been
>> handy for us when we needed to test taking bpdus off the wire for a bit.
>> 
>> because it does such a small amount of work, it is relatively fast.
>> hrvoje popovski has given it a quick spin and seen the following
>> results on a fast box with a pair of ix(4) interfaces:
>> 
>> plain ip forwarding: 1.5Mpps
>> bridge(4) under load from 14Mpps: 500Kpps
>> bridge(4) under load from 1Mpps: 800Kpps
>> tpmr(4): 1.75Mpps
>> 
>> 1.75Mpps was lower than I was expecting, but it turns out he was hitting
>> limits in other parts of the system. with some tuning we got it up to
>> 2.25Mpps. the softnet taskq was only at about 66% cpu time, but we
>> couldnt see any other obvious places that we were dropping load.
>> 
>> on a slower box that can do IP forwarding at 1Mpps, tpmr(4) can do
>> 1.6Mpps. it's worth noting that the boxes were extremely responsive (ie,
>> ssh feels fine) when tpmr is under load, which is not the case when ip
>> forwarding or bridge are being hammered.
>> 
>> my point is that it might be useful having tpmr(4) just to be able to
>> test network driver performance improvements independently of the stack.
>> im probably going to be using it to monitor links as a "bump in the
>> wire" too.
>> 
>> lastly regarding the code. i made this use the trunk(4) ioctls instead
>> of the bridge ones, mostly because i had to fake less stuff to make
>> ifconfig output look ok.
>> 
>> ifconfig output looks like this:
>> 
>> xdlg@dc3-bridge1:~$ ifconfig tpmr
>> 
>> tpmr0: flags=51<UP,POINTOPOINT,RUNNING>
>>      description: xconnect
>>      index 15 priority 0 llprio 7
>>      trunk: trunkproto none
>>              ix2 port active,collecting,distributing
>>              etherip10 port active,collecting,distributing
>>      groups: tpmr
>>      status: active
>> 
>> anyway. thoughts? ok?
> 
> Have you tried to use bridge with STP enabled in your setup? Just curious.
> I understand that with STP on the OpenBSD box you could not use all links
> and forwarding performance would not be as good.

The ports I plug into in one of the datacenters are on fabric extenders on a 
cisco nexus setup, and they don't like to do spanning tree. I ended up having 
to do LACP.

dlg

> 
> Anyway, I think tpmr would be a nice addition!
> 
> Remi

Reply via email to