On 08/09/2025 16:09, Ilya Maximets wrote:
> On 9/4/25 1:57 PM, Yael Chemla wrote:
>> On 01/09/2025 22:15, Ilya Maximets wrote:
>>> On 8/26/25 2:31 PM, Yael Chemla wrote:
>>>> On 25/07/2025 22:02, Ilya Maximets wrote:
>>>>> Hi, Yael.  Thanks for the patch.
>>>>>
>>>>> On 7/23/25 10:13 PM, Yael Chemla wrote:
>>>>>> Commit aee4f9aec ("netdev-offload-tc: Match against tunnel flags
>>>>>> if supported") added support to match against 'csum' and 'df' tunnel
>>>>>> flags in TC, assuming they’re only set when actually needed. But
>>>>>> the code in tnl_wc_init() always sets these flags by default,
>>>>>> without checking if HW offload is enabled or if the flow explicitly
>>>>>> requests them. This breaks HW offload, since mlx5 (and other NICs)
>>>>>> do not support offloading when these flags are set.
>>>>>>
>>>>>> To work around this, users are currently have to add the following
>>>>>> options to their flows "options:df_default=false options:csum=false".
>>>>>> This patch helps by adjusting the wildcard mask according to actual
>>>>>> tunnel config. If 'csum' or 'df' are disabled, the mask will no
>>>>>> longer include them.
>>>>>
>>>>> These are the fields of the packet header and they nave nothing to do
>>>>> with the configuration of the tunnel port on the destination.  Instead,
>>>>> they depend on the tunnel configuration on the traffic source.  So this
>>>>> doesn't make a lot of sense to match or not match based on the
>>>>> configuration of the destination port.
>>>>>
>>>>
>>>> Hi Ilya, thank you for the clarifications.
>>>> It seems we misunderstood the semantics of these options.
>>>>>>
>>>>>> Note that the decision to match these flags depends on
>>>>>> `enc_flags_support`, which is probed once via
>>>>>> `probe_enc_flags_support()`. In most cases, this probe is performed
>>>>>> against the ovs-system device with `tc_policy=SKIP_HW`, which causes
>>>>>> the probe to succeed regardless of whether the underlying hardware
>>>>>> can actually offload these flags. Therefore, relying solely on
>>>>>> `enc_flags_support` is insufficient and can result in incorrect
>>>>>> behavior. This patch works around the issue by ensuring that even
>>>>>> if support is globally enabled, the wildcard mask reflects the actual
>>>>>> per-port tunnel configuration to avoid unnecessary matches that would
>>>>>> break hardware offload.
>>>>>
>>>>> Again, this match represents the actual packet header that depends on
>>>>> the configuration of the tunnel port on the traffic source.
>>>>>
>>>>> Can this be resolved on a kernel level, i.e. in the driver?  These are
>>>>> just packet header fields, so it should be possible for the NIC to set
>>>>> them.  E.g. the csum bit should be set if the udphdr->check != 0.
>>>>>
>>>>
>>>> In the case of the mlx5 driver, when hardware offload is enabled the
>>>> hardware cannot match on DF or CSUM. From your perspective, would it be
>>>> acceptable in such a case to skip matching those bits?
>>>
>>> I don't think so.  We can change OVS to avoid matching on these flags
>>> when it is not necessary, but datapath implementation should not ignore
>>> things in general, especially since matching on these bits is exposed
>>> not only through OVS but is also available for users setting up TC
>>> flower directly.
>>
>> We can choose not to match on DF/CSUM if using TC flower directly, while
>> OVS doesn't provide that flexibility. The hw-offload is broken after
>> kernel commit aee4f9aec ("netdev-offload-tc: Match against tunnel flags
>> if supported").>
> 
> Sure, as I said before, we can also change OVS to avoid matches on these
> flags when they are not necessary.  But, again, the datapath itself should
> not ignore the bits, as it can't, in general, make that kind of decisions.
> 

The datapath does not "ignore" these bits, in our setup the mlx5 path
rejects offload attempts that include DF/CSUM matches because the
driver/hardware does not support matching those bits. To avoid causing
offload failures we propose that OVS should not add DF/CSUM into the
wildcard mask unless the classifier explicitly requests them.

>>> Is it a limitation in the hardware?  Seems strange that it can match
>>> on the key flag, but can't match on df bit.  Or is the key flag also
>>> mostly just ignored?
>>
>> Yes, this is indeed a hardware limitation. From the kernel patches, it
>> seems to be a limitation across all mainstream NICs:
>> 5a1b015d521d ice: flower: validate encapsulation control flags
>> 34cdd9847820 nfp: flower: validate encapsulation control flags
>> 28d19ec91755 net/mlx5e: flower: validate encapsulation control flags
>> 2ede54f8786f sfc: use flow_rule_is_supp_enc_control_flags()
>> b48a1540b73a flow_offload: add encapsulation control flag helpers
> 
> These commits were added as a stopgap to avoid falsely accept flows offload
> for which is not implemented in the driver.  I highly doubt that hardware
> capabilities were consulted before making the change.  The thing is that
> driver needs to have an explicit support and they do not today.  It doesn't
> mean the hardware can't extract this information.
> 
>> Which key flag are you referring to? Did you mean key:option?
> 
> I meant the IP_TUNNEL_KEY_BIT that signifies existence of the tunnel ID
> (which is a different thing for different tunnels, e.g. always present for
> VXLAN, but is optional for GRE).
> 

Just to clarify, in the mlx5 driver, the IP_TUNNEL_KEY_BIT is taken care
of. On receive, the decap logic pulls out the tunnel key/VNI from the
inner packet, sets the tun_id, and marks IP_TUNNEL_KEY_BIT.
On transmit, the encapsulation code reads that tun_id and writes it into
the right protocol field (for VXLAN, that means mapping the tun_id to
the VNI, vxlan_vni). This is a separate capability from DF/CSUM matching.

>>>>> From OVS perspective, kernel reports that these fields are supported in
>>>>> TC, so we use them, because that's the correct thing to do.  There is
>>>>> no real way to probe the actual card, especially for tunnel ports, as
>>>>> offloading is happening through the bridge port instead.
>>>>
>>>> Currently, probe_enc_flags_support() does not consider whether hardware
>>>> offload is enabled. We have another patch for netdev-offload-tc.c that
>>>> modifies this probing logic to explicitly test whether TC can match on
>>>> DF and CSUM.
>>>
>>> How are you probing this?  I guess it's not by trying to install flows
>>> with skip_sw, as you mentioned above that it's not enough, but I'm not
>>> sure how the probing can be done otherwise.
>>
>> Yes, it seems no other way but installing flow with skip_sw to probe.
>> I’ll send the patch shortly. In essence, it tries a tc_replace_flower
>> with the CSUM and DF tunnel flags on tunnel devices, and updates the
>> enc_flags_support accordingly based on the result.
> 
> I'm still not sure how that supposed to work when the tunnel flows are
> installed on the bridge ingress qdisc.
> 

Our next patch uses a two-step probe to validate hardware support for
ENC_FLAGS. (1) A global skip_hw probe checks whether the kernel
recognizes TCA_FLOWER_KEY_ENC_FLAGS. (2) Per device, if it’s a tunnel
and hw offload is enabled, we install a temporary flower rule and read
it back with tc_get_flower to confirm offloaded_state == IN_HW. We cache
the result per device. This avoids false positives where the global
skip_hw probe succeeds but the device supports the key only in software.

>>>>> There is an argument that can be made that OVS doesn't really need to
>>>>> match on these two bits, but it's a different argument with a different
>>>>> implementation.  And we may need to match on these bits in the future.
>>>>>
>>>> Could you please point me to that discussion? I’d like to review the
>>>> correspondence to better understand the context.
>>>
>>> "can be made", i.e. it was not previously discussed, so someone needs
>>> to make that argument.  The base idea would be that these flags are not
>>> exposed through OpenFlow today and OVS doesn't make a lot of decisions
>>> based on these two flags.  At the same time the argument will fall apart
>>> as soon some of these conditions are no longer true, in which case we
>>> still need the datapath to behave correctly, i.e. reject the flows that
>>> it can't properly support.
>>
>> Thanks for clarifying. Understood that the base idea is these flags are
>> not currently exposed via OpenFlow, and OVS doesn’t rely heavily on them
>> today. But as you point out, this may change in the future, and we’ll
>> still need the datapath to reject flows it cannot properly support.>
> 
> This can be properly implemented if we not include the bits in the mask
> unless they are set by the classifier.  This way if users will explicitly
> ask for a match on these bits through OpenFlow, such flows will not be
> offloaded, but we have many other non-offloadable matches and actions
> today, so it should be fine.  At the same time, if the driver can be
> updated to properly match on the bits, that would be better.
> 
I agree that not including DF/CSUM in the mask unless the classifier
requests them is the correct short-term fix and will address the offload
breakages introduced after the mentioned OVS commit. The mlx5 driver
currently rejects offloads that require matching those bits, so changing
OVS mask behavior prevents the failure mode.

>>>>> Beside the patch being corrupted, as Aaron mentioned, it is also missing
>>>>> some unit tests for the change.
To be sure I understand, do you mean I should add a new test in the
tests/ directory (so it runs as part of make check), or did you have a
specific kind of test in mind? If it’s the latter, could you please give
me a hint or an example of what would be most useful here?

>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>
>>>
>>
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to