On 5/1/25 12:02 AM, Han Zhou wrote:
> Hi,
> 
> We encountered a problem of bursts of packets missing the megaflow cache and 
> going to slowpath when these packets have the ECN CE (congestion experienced) 
> bit set to 1. While the purpose of the ECN CE bit is to signal a congestion 
> experienced by the router/switch and hoping the receiving side application 
> can react by notifying the sender to slow down, the CE bit change in the 
> header actually results in slowpath handling on the receiving side, which may 
> make the congestion worst for a short time because it slows down the 
> receiving side handling for the burst of the CE-bit-set packets.
> 
> The reason why the ECN CE bit change leading to megaflow cache miss is 
> because for any traffic that have non-zero TOS (including DSCP and ECN) field 
> in the tunnel header triggers megaflow generation with exact match in the TOS 
> field in the tunnel header, for example:
> 
> skb_priority(0/0),tunnel(tun_id=0x231,src=10.2.0.152,dst=10.2.64.11,tos=0x8a,...,ipv4(src=10.194.120.8,dst=10.194.168.251,proto=17,tos=0/0,ttl=0/0,frag=no),...,
>  actions:73
> 
> Here for the tos=0x8a, the last 2 bits are ECN: 10 - means ECN-Capable (ECT)
> So far all packets of the flow hit the megaflow cache. Now when congestion is 
> detected by the uplink router, the router set the last bit to 1, and the 
> packets doesn't match this existing megaflow and go through slowpath and 
> generates a new megaflow:
> 
> skb_priority(0/0),tunnel(tun_id=0x231,src=10.2.0.152,dst=10.2.64.11,tos=0x8b,...,ipv4(src=10.194.120.8,dst=10.194.168.251,proto=17,tos=0x3/0x3,ttl=0/0,frag=no),...,
>  actions:73
> 
> Here for the tos=0x8b, the last 2 bits are ECN: 11 - means Congestion 
> Experienced (CE)
> In our case the packet rate is very high during the congestion and before the 
> new megaflow is generated there are already thousands of packets going to 
> slowpath.
> 
> The root cause of the issue is that OVS handles the ECN bits implicitly (not 
> configurable/controllable by OpenFlow rules). It seems it unwildcards those 
> bits in the match just to detect the CE bit change so that it can replicate 
> the CE bit to the inner header. From the tcpdump on both tunnel interface and 
> the VM interface, we confirmed that the OVS indeed set the CE bit of the 
> inner IP header according to the CE bit of the tunnel outer IP header. We 
> believe this is a desired behavior because otherwise the congestion control 
> signalling won't work when the router only sees and updates the outer header, 
> while the application/protocol layer that handles the congestion control bit 
> is on the overlay logical network which can only see the inner header.
> 
> I didn't find any detailed documentation for this behavior, except that in 
> https://www.openvswitch.org/support/dist-docs/ovs-vswitchd.conf.db.5.txt 
> <https://www.openvswitch.org/support/dist-docs/ovs-vswitchd.conf.db.5.txt> it 
> is briefly mentioned as a side note in the description of the options:tos for 
> tunnel configuration.
> 
> The code that updates the inner header ECN is: 
> https://github.com/openvswitch/ovs/blob/main/ofproto/tunnel.c#L357 
> <https://github.com/openvswitch/ovs/blob/main/ofproto/tunnel.c#L357>
> 
> We can also see the code initialized the tunnel TOS field to all 1 
> (unwildcard):
> https://github.com/openvswitch/ovs/blob/main/ofproto/tunnel.c#L378 
> <https://github.com/openvswitch/ovs/blob/main/ofproto/tunnel.c#L378>
> 
> We did see megaflows with TOS field wildcarded (0/0) for traffic that has TOS 
> value 0, but didn't find the corresponding code that wildcarded it.
> 
> We believe in the fastpath (e.g. OVS kernel module) implementation it must 
> have followed the same implicit logic, because as we can see the action part 
> of the megaflow doesn't have any actions that modifies the CE bit.
> 
> So now the question is, how to avoid the exact match for those bits in the 
> megaflow, while still being able to satisfy the requirement of ECN handling 
> for tunneled packets. It would be good to see a more detailed explanation of 
> the behavior and its original requirement. While I didn't find any such 
> details other than the above pieces of document and code, I guess the 
> requirement is just to replicate the CE bit from the outer header to the 
> inner header, i.e. set the inner CE bit to 1 if the CE bit in the outer 
> header is 1. If that's the case, then we wonder if we could just always 
> wildcard these bits and always do the same implicit handling for tunneled 
> packets, which would solve the megaflow cache miss while still satisfying the 
> ECN handling requirement? Or, is there any other reason we do exact match for 
> these bits?

Hi, Han.

That's an interesting issue.  The documentation you're looking for is RFC 6040
that describes how the ECN bits should be passed around during encap/decap 
process.
Specifically:
  https://www.rfc-editor.org/rfc/rfc6040#page-10

Tunnel implementations in the linux kernel seem to support the same logic from 
the
RFC, so it might be possible to avoid exact matches on the ECN bits.  However, 
there
is no real way for ofproto layer to detect if this is implemented in the 
datapath or
not.  Userspace datapath doesn't implement this, Windows datapath doesn't, we 
use
raw encap for rte_flow, so that will also likely not support this.  TC offload 
I'm
not sure about, I saw some references to ECN in the mlx driver, but I'm not 
confident
these are relevant.  Besides, implementing support for this in userspace 
datapath may
be problematic from the performance point of view as we'll need to add extra 
parsing
logic to the datapath and modify headers on per-packet basis.  Though it's hard 
to
tell what the performance impact will be.

So, if we find a good way of detecting the datapath support, then it might be a
good improvement to avoid flow-based ECN handling when datapath supports it.
But it may not be worth implementing this handling in all the datapaths.

Best regards, Ilya Maximets.

> 
> Any information or suggestions are highly appreciated.
> 
> Best regards,
> Han

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to