On 2 Jul 2024, at 15:38, Ilya Maximets wrote:

> On 6/27/24 15:03, Eelco Chaudron wrote:
>>
>>
>> On 4 Jun 2024, at 13:52, Ilya Maximets wrote:
>>
>>> On 6/4/24 13:42, Eelco Chaudron wrote:
>>>>
>>>>
>>>> On 1 Jun 2024, at 0:08, Ilya Maximets wrote:
>>>>
>>>>> On 5/7/24 15:52, Eelco Chaudron wrote:
>>>>>> While offloading header modifications to TC, OVS is using {TCA_PEDIT} +
>>>>>> {TCA_CSUM} combination as that it the only way to represent header
>>>>>> rewrite.  However, {TCA_CSUM} is unable to calculate L4 checksums for
>>>>>> IP fragments.
>>>>>>
>>>>>> Since TC already applies fragmentation bit masking, this patch simply
>>>>>> needs to prevent these packets from being processed through TC.
>>>>>>
>>>>>> Signed-off-by: Eelco Chaudron <[email protected]>
>>>>>> ---
>>>>>> v2: - Fixed and added some comments.
>>>>>>     - Use ovs-pcap to compare packets.
>>>>>>
>>>>>> NOTE: This patch needs an AVX512 fix before it can be applied.
>>>>>>       Intel is working on this.
>>>>>> ---
>>>>>>  lib/netdev-offload-tc.c | 32 ++++++++++++++
>>>>>>  lib/tc.c                |  5 ++-
>>>>>>  tests/system-traffic.at | 93 +++++++++++++++++++++++++++++++++++++++++
>>>>>>  3 files changed, 129 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
>>>>>> index 921d52317..bdd307933 100644
>>>>>> --- a/lib/netdev-offload-tc.c
>>>>>> +++ b/lib/netdev-offload-tc.c
>>>>>> @@ -1488,6 +1488,30 @@ parse_put_flow_ct_action(struct tc_flower *flower,
>>>>>>          return 0;
>>>>>>  }
>>>>>>
>>>>>> +static bool
>>>>>> +will_tc_add_l4_checksum(struct tc_flower *flower, int type)
>>>>>> +{
>>>>>> +    /* This function returns true if the tc layer will add a l4 
>>>>>> checksum action
>>>>>> +     * for this set action.  Refer to the csum_update_flag() function 
>>>>>> for
>>>>>> +     * detailed logic.  Note that even the kernel only supports 
>>>>>> updating TCP,
>>>>>> +     * UDP and ICMPv6. */
>>>>>
>>>>> This comment should be outside of the function, I think.  It's strange
>>>>> to have it here.  I see csum_update_flag() has a comment inside, but
>>>>> that's strange as well.  That function has other style issues as well,
>>>>> there is no need to copy them.
>>>>
>>>> ACK, will clean this up in the next rev.
>>>>
>>>>>> +    switch (type) {
>>>>>> +    case OVS_KEY_ATTR_IPV4:
>>>>>> +    case OVS_KEY_ATTR_IPV6:
>>>>>> +    case OVS_KEY_ATTR_TCP:
>>>>>> +    case OVS_KEY_ATTR_UDP:
>>>>>> +        switch (flower->key.ip_proto) {
>>>>>> +        case IPPROTO_TCP:
>>>>>> +        case IPPROTO_UDP:
>>>>>> +        case IPPROTO_ICMPV6:
>>>>>> +        case IPPROTO_UDPLITE:
>>>>>> +            return true;
>>>>>> +        }
>>>>>> +        break;
>>>>>> +    }
>>>>>> +    return false;
>>>>>> +}
>>>>>> +
>>>>>>  static int
>>>>>>  parse_put_flow_set_masked_action(struct tc_flower *flower,
>>>>>>                                   struct tc_action *action,
>>>>>> @@ -1520,6 +1544,14 @@ parse_put_flow_set_masked_action(struct tc_flower 
>>>>>> *flower,
>>>>>>          return EOPNOTSUPP;
>>>>>>      }
>>>>>>
>>>>>> +    if (flower->key.flags & TCA_FLOWER_KEY_FLAGS_IS_FRAGMENT
>>>>>
>>>>> We're using this field to make an offloading decision.  We must
>>>>> also set in the mask.  If for some reason we're not matching on
>>>>> the fragment bits, we may first receive a non-fragmented packet
>>>>> and offload it, then fragmented traffic may match and fail the
>>>>> checksumming.  So, we need to enable the bit in the mask.
>>>>
>>>> The dp always matches on the fragment bit for IPv4 packets, I did some 
>>>> tests with this.
>>>> So if we sent a non-fragment packet first the dp rule will match on 
>>>> fragment = 0. Or
>>>> did I miss something?
>>>
>>> It is true today, but nothing ensures that on the netdev-offload-tc level.
>>> Moreover, the netdev-offload-tc is written in a way that it expects the
>>> frag bits to potentially not be in the mask:
>>> https://github.com/openvswitch/ovs/blob/1d681ffe3b208a0db4945b6389142ab18404a4d1/lib/netdev-offload-tc.c#L2432-L2448
>>> So, it is going to be internally inconsistent if we do not set the bit
>>> explicitly.
>>>
>>> And if someday we'll stop adding the frag bit, we'll never know if we
>>> forget to set it in netdev-offload-tc.  At the very least we'll need an
>>> assertion that it is set.  But having an assertion will still be internally
>>> inconsistent with the code linked above.  So, it may be better to just fix
>>> it instead anyway.
>>
>> I finally got some time to look at the code, and I can add the below abort()
>> in case OVS decides to no longer mask out the frag bits by default.
>>
>> However, I have no clear idea how to make this work if we ever stop adding
>> this frag mask. I don't think we can report back to request a wider mask or
>> track what exists and force a revalidate.
>>
>> @@ -2447,6 +2479,8 @@ netdev_tc_flow_put(struct netdev *netdev, struct match 
>> *match,
>>              }
>>
>>              mask->nw_frag = 0;
>> +        } else {
>> +            ovs_abort(0, "TC offload assumes nw_frag is always masked");
>>          }
>
> We don't need to abort.  We should just narrow down the match by adding
> frag bits to the mask here.  Datapath implementations are allowed to
> narrow down the match, they are not allowed to make them wider.

I found a corner case where we do not set the frag mask correctly (it might be 
a bug). It’s testcase 'datapath - ping between two ports on cvlan’, which 
creates the following DP rule:

  
recirc_id(0),in_port(ovs-p0),eth(src=6e:48:c8:77:d3:8c,dst=12:f6:8b:52:f9:75),eth_type(0x88a8),vlan(vid=4094),encap(eth_type(0x0000),vlan(vid=100,pcp=0),encap(eth_type(0x0800)))

I tried setting the mask, mask->nw_frag = FLOW_NW_FRAG_MASK, but it will not 
add the DP rule with the frag match.

I’ll try to dig into this a bit further as it might be a bug with the cvlan 
code.


>>>>>> +        && will_tc_add_l4_checksum(flower, type)) {
>>>>>> +        VLOG_DBG_RL(&rl, "set action type %d not supported on fragments 
>>>>>> "
>>>>>> +                    "due to checksum limitation", type);
>>>>>> +        ofpbuf_uninit(&set_buf);
>>>>>> +        return EOPNOTSUPP;
>>>>>> +    }
>>>>>> +
>>>>>>      for (i = 0; i < ARRAY_SIZE(set_flower_map[type]); i++) {
>>>>>>          struct netlink_field *f = &set_flower_map[type][i];
>>>>>>
>>>>>> diff --git a/lib/tc.c b/lib/tc.c
>>>>>> index e9bcae4e4..20472d13c 100644
>>>>>> --- a/lib/tc.c
>>>>>> +++ b/lib/tc.c
>>>>>> @@ -2958,7 +2958,10 @@ csum_update_flag(struct tc_flower *flower,
>>>>>>       * eth(dst=<mac>),eth_type(0x0800) actions=set(ipv4(src=<new_ip>))
>>>>>>       * we need to force a more specific flow as this can, for example,
>>>>>>       * need a recalculation of icmp checksum if the packet that passes
>>>>>> -     * is ICMPv6 and tcp checksum if its tcp. */
>>>>>> +     * is ICMPv6 and tcp checksum if its tcp.
>>>>>> +     *
>>>>>> +     * Ensure that you keep the pre-check function in 
>>>>>> netdev-offload-tc.c,
>>>>>> +     * will_tc_add_l4_checksum(), in sync if you make any changes here. 
>>>>>> */
>>>>>>
>>>>>>      switch (htype) {
>>>>>>      case TCA_PEDIT_KEY_EX_HDR_TYPE_IP4:
>>>>>> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
>>>>>> index bd7647cbe..8f392abd5 100644
>>>>>> --- a/tests/system-traffic.at
>>>>>> +++ b/tests/system-traffic.at
>>>>>> @@ -8977,3 +8977,96 @@ OVS_WAIT_UNTIL([cat p2.pcap | grep -E "0x0050: 
>>>>>> *0000 *0000 *5002 *2000 *b85e *00
>>>>>>
>>>>>>  OVS_TRAFFIC_VSWITCHD_STOP
>>>>>>  AT_CLEANUP
>>>>>> +
>>>>>> +AT_SETUP([datapath - mod_nw_src/set_field on IP fragments])
>>>>>> +AT_SKIP_IF([test $HAVE_TCPDUMP = no])
>>>>>> +OVS_TRAFFIC_VSWITCHD_START()
>>>>>> +
>>>>>> +ADD_NAMESPACES(at_ns0, at_ns1)
>>>>>> +
>>>>>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24", 36:b1:ee:7c:01:03)
>>>>>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24", 36:b1:ee:7c:01:02)
>>>>>> +
>>>>>> +AT_DATA([flows.txt], [dnl
>>>>>> +  in_port=ovs-p0,ip,nw_src=10.1.1.1 actions=mod_nw_src=11.1.1.1,ovs-p1
>>>>>> +  in_port=ovs-p0,ipv6,ipv6_src=fc00::1 
>>>>>> actions=set_field:fc00::100->ipv6_src,ovs-p1
>>>>>> +])
>>>>>> +
>>>>>> +AT_CHECK([ovs-ofctl del-flows br0])
>>>>>> +AT_CHECK([ovs-ofctl -Oopenflow13 add-flows br0 flows.txt])
>>>>>> +
>>>>>> +NETNS_DAEMONIZE([at_ns1],
>>>>>> +                [tcpdump -l -nn -xx -U -i p1 -w p1.pcap 2> tcpdump.err],
>>>>>> +                [tcpdump.pid])
>>>>>> +OVS_WAIT_UNTIL([grep "listening" tcpdump.err])
>>>>>> +
>>>>>> +dnl IPv4 Packet content:
>>>>>> +dnl   Ethernet II, Src: 36:b1:ee:7c:01:03, Dst: 36:b1:ee:7c:01:02
>>>>>> +dnl       Type: IPv4 (0x0800)
>>>>>> +dnl   Internet Protocol Version 4, Src: 10.1.1.1, Dst: 10.1.1.2
>>>>>> +dnl       0100 .... = Version: 4
>>>>>> +dnl       .... 0101 = Header Length: 20 bytes (5)
>>>>>> +dnl       Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
>>>>>> +dnl       Total Length: 38
>>>>>> +dnl       Identification: 0x0001 (1)
>>>>>> +dnl       001. .... = Flags: 0x1, More fragments
>>>>>> +dnl           0... .... = Reserved bit: Not set
>>>>>> +dnl           .0.. .... = Don't fragment: Not set
>>>>>> +dnl           ..1. .... = More fragments: Set
>>>>>> +dnl       ...0 0000 0000 0000 = Fragment Offset: 0
>>>>>> +dnl       Time to Live: 64
>>>>>> +dnl       Protocol: UDP (17)
>>>>>> +dnl       Header Checksum: 0x44c2
>>>>>> +dnl   Data (18 bytes)
>>>>>> +eth="36 b1 ee 7c 01 02 36 b1 ee 7c 01 03 08 00"
>>>>>> +ip="45 00 00 26 00 01 20 00 40 11 44 c2 0a 01 01 01 0a 01 01 02"
>>>>>> +data="0b c4 08 84 00 26 e9 64 01 02 03 04 05 06 07 08 09 0a"
>>>>>> +packet="${eth} ${ip} ${data}"
>>>>>
>>>>> Since you're backporting the compose-packet functionality now, it's better
>>>>> if we use it here instead.  We may want to get my sendpkt patch first:
>>>>> https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/
>>>>>
>>>>> And then follow a pattern from the other patches from the same set:
>>>>> https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/
>>>>>
>>>>> What do you think?
>>
>> I tried this, but no luck, as I cannot control the checksum needed on the UDP
>> side. It must include some missing data (i.e., the data in the next 
>> fragment).
>
> It doesn't have to.  compose-packet will just set the fragment bit,
> i.e. it'll be a full packet, but marked as a first fragment.  It's
> technically allowed, IIUC.  TC will bail on checksum re-calculation,
> while OVS kernel will update the existing one.
>
> And we don't need to guess the checksum, we can re-generate the packet
> after updating the fields.

The problem is that it will not catch the bug this way. TC will do a full 
re-calculation, which will result in a valid checksum for UDP.

>> Also, the code lacks IPv6 fragment header support, and adding it fully might
>> not be trivial.
>
> I missed that.  We're indeed missing support for generation of IPv6
> fragments.
>
>>
>> So, my suggestion is to keep it as is for now.
>>
>> Thoughts?
>
> I guess, it's fine to keep as-is for now, since we can't generate
> IPv6 fragments correctly.
>
> Best regards, Ilya Maximets.

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to