Hi Ben,
Let me try to explain the motivation of this patch. It actually fixes a severe
design flaw in the native tunnel handling of OVS in the context of
SDN-controlled cloud systems, which we found and had to fix downstream in our
NFVI solution.
To implement tunnels (e.g. VXLAN, GRE) in the userspace datapath, OVS needs to
have the MAC address of the IP nexthop on the underlay network (either the
remote tunnel end-point in the case of an L2 underlay or a gateway in case of a
L3 underlay). To this end OVS snoops passing ARP reply messages to build up an
internal ARP cache. (I don't mention IPv6 ND in this mail, but the equivalent
holds there as well).
The current ARP snooping logic is broken because OVS simply snoops any ARP
packet in any bridge after each flow/group table lookup:
static void
do_xlate_actions(const struct ofpact *ofpacts, size_t ofpacts_len,
struct xlate_ctx *ctx, bool is_last_action)
{
struct flow_wildcards *wc = ctx->wc;
struct flow *flow = &ctx->xin->flow;
const struct ofpact *a;
/* FIXME: This unconditional snooping in all bridges and every flow
translation is bad. */
if (ovs_native_tunneling_is_on(ctx->xbridge->ofproto)) {
tnl_neigh_snoop(flow, wc, ctx->xbridge->name);
}
This happens even in overlay bridges carrying the tenant network traffic. What
is worse is that OVS stores all IP-MAC address pairs in a single global per
bridge, not considering VLANs or any other SDN-controlled network separation.
As IP addresses are generally not unique across tenant networks in NFVI clouds,
there are frequent collisions and these useless ARP cache entries are
constantly overwritten.
Every update in the ARP table is considered a switch configuration change and
triggers the revalidation of all existing datapath flow cache entries, an
expensive operation. The result is constant high ovs-vswitchd CPU load and even
data plane disturbances. In the worst case the correct ARP entry for the tunnel
nexthop can be overwritten, resulting in lost tunnel packets.
This patch addresses the problem by:
* Moving the tnl_neigh_snoop() call to the only place where ARP snooping makes
sense: at output to the bridge's LOCAL port, as this is the only supported port
for tunnel endpoints today.
* Checking that the ARP reply is destined to one of the known IP addresses
assigned to the bridge's LOCAL port before snooping.
* As the actual tunnel endpoint is typically a LOCAL port of a NORMAL (non-SDN
controlled) underlay bridge (e.g. br-prv), the VLAN filtering is implicitly
done by the NORMAL action, so that only packets in the LOCAL port's VLAN are
seen by the output function.
* On the SDN-controlled bridges that handle tenant networking ("br-int"), the
LOCAL port does not participate in traffic. No tenant packets will be seen by
the tunnel neighbor snooping, so that frequent collisions are totally avoided.
I hope that clarifies the big picture. I leave the more technical aspects of
the data structures to Zoltan.
@Zoltan: Can you include this explanation into the commit message make sure the
code is commented as well in the critical places?
Regards, Jan
> I don't understand yet the motivation here. What does it mean "to keep
> ARP neighbor cache clean"? Is this a bug fix, a performance fix, a
> cleanup, or something else?
>
> What is the goal of the filtering you mention? Is that a second change,
> that can and should be a separate patch, or is it closely coupled to the
> first change you mention?
>
> This gives me the following Clang errors:
>
> ../ofproto/ofproto-dpif-xlate.c:3749:13: error: cast from 'const uint8_t
> *' (aka 'const unsigned char *') to 'const struct in6_addr *'
> increases required alignment from 1 to 4 [-Werror,-Wcast-align]
> /usr/include/netinet/in.h:454:36: note: expanded from macro
> 'IN6_ARE_ADDR_EQUAL'
> ../ofproto/ofproto-dpif-xlate.c:3749:13: error: cast from 'const uint8_t
> *' (aka 'const unsigned char *') to 'const struct in6_addr *'
> increases required alignment from 1 to 4 [-Werror,-Wcast-align]
> /usr/include/netinet/in.h:455:36: note: expanded from macro
> 'IN6_ARE_ADDR_EQUAL'
>
> The change to tnl-neigh-cache.c that just adds an #include directive
> seems odd to me.
>
> The treatment of the new xbridge_addr is different from the treatment of
> everything already handled in xlate_ofproto_set(). Why does it need
> special treatment? Is it RCU safe? Why does it need a refcount (none
> of the other structs do)?
>
> Thanks,
>
> Ben.
> _______________________________________________
> dev mailing list
> [email protected]
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev