The 07/30/2019 11:58, Nikolay Aleksandrov wrote:
> On 30/07/2019 11:30, Allan W. Nielsen wrote:
> > The 07/30/2019 10:06, Ido Schimmel wrote:
> >> On Tue, Jul 30, 2019 at 08:27:22AM +0200, Allan W. Nielsen wrote:
> >>> The 07/29/2019 20:51, Ido Schimmel wrote:
> >>>> Can you please clarify what you're trying to achieve? I just read the
> >>>> thread again and my impression is that you're trying to locally receive
> >>>> packets with a certain link layer multicast address.
> >>> Yes. The thread is also a bit confusing because we half way through 
> >>> realized
> >>> that we misunderstood how the multicast packets should be handled (sorry 
> >>> about
> >>> that). To begin with we had a driver where multicast packets was only 
> >>> copied to
> >>> the CPU if someone needed it. Andrew and Nikolay made us aware that this 
> >>> is not
> >>> how other drivers are doing it, so we changed the driver to include the 
> >>> CPU in
> >>> the default multicast flood-mask.
> >> OK, so what prevents you from removing all other ports from the
> >> flood-mask and letting the CPU handle the flooding? Then you can install
> >> software tc filters to limit the flooding.
> > I do not have the bandwidth to forward the multicast traffic in the CPU.
> > 
> > It will also cause enormous latency on the forwarding of L2 multicast 
> > packets.
> > 
> >>> This changes the objective a bit. To begin with we needed to get more 
> >>> packets to
> >>> the CPU (which could have been done using tc ingress rules and a trap 
> >>> action).
> >>>
> >>> Now after we changed the driver, we realized that we need something to 
> >>> limit the
> >>> flooding of certain L2 multicast packets. This is the new problem we are 
> >>> trying
> >>> to solve!
> >>>
> >>> Example: Say we have a bridge with 4 slave interfaces, then we want to 
> >>> install a
> >>> forwarding rule saying that packets to a given L2-multicast MAC address, 
> >>> should
> >>> only be flooded to 2 of the 4 ports.
> >>>
> >>> (instead of adding rules to get certain packets to the CPU, we are now 
> >>> adding
> >>> other rules to prevent other packets from going to the CPU and other 
> >>> ports where
> >>> they are not needed/wanted).
> >>>
> >>> This is exactly the same thing as IGMP snooping does dynamically, but 
> >>> only for
> >>> IP multicast.
> >>>
> >>> The "bridge mdb" allow users to manually/static add/del a port to a 
> >>> multicast
> >>> group, but still it operates on IP multicast address (not L2 multicast
> >>> addresses).
> >>>
> >>>> Nik suggested SIOCADDMULTI.
> >>> It is not clear to me how this should be used to limit the flooding, 
> >>> maybe we
> >>> can make some hacks, but as far as I understand the intend of this is 
> >>> maintain
> >>> the list of addresses an interface should receive. I'm not sure this 
> >>> should
> >>> influence how for forwarding decisions are being made.
> >>>
> >>>> and I suggested a tc filter to get the packet to the CPU.
> >>> The TC solution is a good solution to the original problem where wanted 
> >>> to copy
> >>> more frames to the CPU. But we were convinced that this is not the right
> >>> approach, and that the CPU by default should receive all multicast 
> >>> packets, and
> >>> we should instead try to find a way to limit the flooding of certain 
> >>> frames as
> >>> an optimization.
> >>
> >> This can still work. In Linux, ingress tc filters are executed before the
> >> bridge's Rx handler. The same happens in every sane HW. Ingress ACL is
> >> performed before L2 forwarding. Assuming you have eth0-eth3 bridged and
> >> you want to prevent packets with DMAC 01:21:6C:00:00:01 from egressing
> >> eth2:
> >>
> >> # tc filter add dev eth0 ingress pref 1 flower skip_sw \
> >>    dst_mac 01:21:6C:00:00:01 action trap
> >> # tc filter add dev eth2 egress pref 1 flower skip_hw \
> >>    dst_mac 01:21:6C:00:00:01 action drop
> >>
> >> The first filter is only present in HW ('skip_sw') and should result in
> >> your HW passing you the sole copy of the packet.
> > Agree.
> > 
> >> The second filter is only present in SW ('skip_hw', not using HW egress
> >> ACL that you don't have) and drops the packet after it was flooded by
> >> the SW bridge.
> > Agree.
> > 
> >> As I mentioned earlier, you can install the filter once in your HW and
> >> share it between different ports using a shared block. This means you
> >> only consume one TCAM entry.
> >>
> >> Note that this allows you to keep flooding all other multicast packets
> >> in HW.
> > Yes, but the frames we want to limit the flood-mask on are the exact frames
> > which occurs at a very high rate, and where latency is important.
> > 
> > I really do not consider it as an option to forward this in SW, when it is
> > something that can easily be offloaded in HW.
> > 
> >>>> If you now want to limit the ports to which this packet is flooded, then
> >>>> you can use tc filters in *software*:
> >>>>
> >>>> # tc qdisc add dev eth2 clsact
> >>>> # tc filter add dev eth2 egress pref 1 flower skip_hw \
> >>>>  dst_mac 01:21:6C:00:00:01 action drop
> >>> Yes. This can work in the SW bridge.
> >>>
> >>>> If you want to forward the packet in hardware and locally receive it,
> >>>> you can chain several mirred action and then a trap action.
> >>> I'm not I fully understand how this should be done, but it does sound 
> >>> like it
> >>> becomes quite complicated. Also, as far as I understand it will mean that 
> >>> we
> >>> will be using TCAM/ACL resources to do something that could have been 
> >>> done with
> >>> a simple MAC entry.
> >>>
> >>>> Both options avoid HW egress ACLs which your design does not support.
> >>> True, but what is wrong with expanding the functionality of the normal
> >>> forwarding/MAC operations to allow multiple destinations?
> >>>
> >>> It is not an uncommon feature (I just browsed the manual of some common L2
> >>> switches and they all has this feature).
> >>>
> >>> It seems to fit nicely into the existing user-interface:
> >>>
> >>> bridge fdb add    01:21:6C:00:00:01 port eth0
> >>> bridge fdb append 01:21:6C:00:00:01 port eth1
> >>
> >> Wouldn't it be better to instead extend the MDB entries so that they are
> >> either keyed by IP or MAC? I believe FDB should remain as unicast-only.
> > 
> > You might be right, it was not clear to me which of the two would fit the
> > purpose best.
> > 
> > From a user-space iproute2 perspective I prefer using the "bridge fdb" 
> > command
> > as it already supports the needed syntax, and I do not think it will be too
> > pretty if we squeeze this into the "bridge mdb" command syntax.
> > 
> 
> MDB is a much better fit as Ido already suggested. FDB should remain unicast
> and mixing them is not a good idea, we already have a good ucast/mcast 
> separation
> and we'd like to keep it that way.
Okay. We will explore that option.


> > But that does not mean that it need to go into the FDB database in the
> > implementation.
> > 
> > Last evening when I looked at it again, I was considering keeping the
> > net_bridge_fdb_entry structure as is, and add a new hashtable with the
> > following:
> > 
> > struct net_bridge_fdbmc_entry {
> >     struct rhash_head               rhnode;
> >     struct net_bridge_fdbmc_ports   *dst;
> > 
> >     struct net_bridge_fdb_key       key;
> >     struct hlist_node               fdb_node;
> >     unsigned char                   offloaded:1;
> > 
> >     struct rcu_head                 rcu;
> > };
> > 
> 
> What would the notification for this look like ?
Not sure. But we will change the direction and use the MDB structures instead.

> > If we go with this approach then we can look at the MAC address and see if 
> > it is
> > a unicast which will cause a lookup in the fdb, l3-multicast (33:33:* or
> > 01:00:5e:*) which will cause a lookup in the mdb, or finally a fdbmc which 
> > will
> > need to do a lookup in this new hashtable.
> 
> That sounds wrong, you will change the current default behaviour of flooding 
> these
> packets. This will have to be well hidden behind a new option and enabled 
> only on user
> request.
It will only affect users who install a static L2-multicast entry. If no entry
is found, it will default to flooding, which will be the same as before.

> > Alternative it would be like this:
> > 
> > struct net_bridge_fdb_entry {
> >     struct rhash_head               rhnode;
> >     union net_bridge_port_or_list   *dst;
> > 
> >     struct net_bridge_fdb_key       key;
> >     struct hlist_node               fdb_node;
> >     unsigned char                   is_local:1,
> >                                     is_static:1,
> >                                     is_sticky:1,
> >                                     added_by_user:1,
> >                                     added_by_external_learn:1,
> >                                     offloaded:1;
> >                                     multi_dst:1;
> > 
> >     /* write-heavy members should not affect lookups */
> >     unsigned long                   updated ____cacheline_aligned_in_smp;
> >     unsigned long                   used;
> > 
> >     struct rcu_head                 rcu;
> > };
> > 
> > Both solutions should require fairly few changes, and should not cause any
> > measurable performance hit.
> > 
> 
> You'll have to convert these bits to use the proper atomic bitops if you go 
> with
> the second solution. That has to be done even today, but the second case would
> make it a must.
Good to know.

Just for my understanding, is this because this is the "current" guide lines on
how things should be done, or is this because the multi_dst as a special need.

The multi_dst flag will never be changed in the life-cycle of the structure, and
the structure is protected by rcu. If this is causeing a raise, then I do not
see it.

> > Making it fit into the net_bridge_mdb_entry seems to be harder.
> > 
> 
> But it is the correct abstraction from bridge POV, so please stop trying to 
> change
> the FDB code and try to keep to the multicast code.
We are planning on letting the net_bridge_port_or_list union use the
net_bridge_port_group structure, which will mean that we can re-use the
br_multicast_flood function (if we change the signatire to accept the ports
instead of the entry).

> >> As a bonus, existing drivers could benefit from it, as MDB entries are 
> >> already
> >> notified by MAC.
> > Not sure I follow. When FDB entries are added, it also generates 
> > notification
> > events.
> > 
> 
> Could you please show fdb event with multiple ports ?
We will get to that. Maybe this is an argument for converting to mdb. We have
not looked into the details of this yet.

> >>> It seems that it can be added to the existing implementation with out 
> >>> adding
> >>> significant complexity.
> >>>
> >>> It will be easy to offload in HW.
> >>>
> >>> I do not believe that it will be a performance issue, if this is a 
> >>> concern then
> >>> we may have to do a bit of benchmarking, or we can make it a configuration
> >>> option.
> >>>
> >>> Long story short, we (Horatiu and I) learned a lot from the discussion 
> >>> here, and
> >>> I think we should try do a new patch with the learning we got. Then it is 
> >>> easier
> >>> to see what it actually means to the exiting code, complexity, exiting 
> >>> drivers,
> >>> performance, default behavioral, backwards compatibly, and other valid 
> >>> concerns.
> >>>
> >>> If the patch is no good, and cannot be fixed, then we will go back and 
> >>> look
> >>> further into alternative solutions.
> >> Overall, I tend to agree with Nik. I think your use case is too specific
> >> to justify the amount of changes you want to make in the bridge driver.
> >> We also provided other alternatives. That being said, you're more than
> >> welcome to send the patches and we can continue the discussion then.
> > Okay, good to know. I'm not sure I agree that the alternative solutions 
> > really
> > solves the issue this is trying to solve, nor do I agree that this is 
> > specific
> > to our needs.
> > 
> > But lets take a look at a new patch, and see what is the amount of changes 
> > we
> > are talking about. Without having the patch it is really hard to know for 
> > sure.
> Please keep in mind that this case is the exception, not the norm, thus it 
> should
> not under any circumstance affect the standard deployments.
Understood - no surprises.

-- 
/Allan

Reply via email to