On 2 Jul 2024, at 15:38, Ilya Maximets wrote:
> On 6/27/24 15:03, Eelco Chaudron wrote:
>>
>>
>> On 4 Jun 2024, at 13:52, Ilya Maximets wrote:
>>
>>> On 6/4/24 13:42, Eelco Chaudron wrote:
>>>>
>>>>
>>>> On 1 Jun 2024, at 0:08, Ilya Maximets wrote:
>>>>
>>>>> On 5/7/24 15:52, Eelco Chaudron wrote:
>>>>>> While offloading header modifications to TC, OVS is using {TCA_PEDIT} +
>>>>>> {TCA_CSUM} combination as that it the only way to represent header
>>>>>> rewrite. However, {TCA_CSUM} is unable to calculate L4 checksums for
>>>>>> IP fragments.
>>>>>>
>>>>>> Since TC already applies fragmentation bit masking, this patch simply
>>>>>> needs to prevent these packets from being processed through TC.
>>>>>>
>>>>>> Signed-off-by: Eelco Chaudron <[email protected]>
>>>>>> ---
>>>>>> v2: - Fixed and added some comments.
>>>>>> - Use ovs-pcap to compare packets.
>>>>>>
>>>>>> NOTE: This patch needs an AVX512 fix before it can be applied.
>>>>>> Intel is working on this.
>>>>>> ---
>>>>>> lib/netdev-offload-tc.c | 32 ++++++++++++++
>>>>>> lib/tc.c | 5 ++-
>>>>>> tests/system-traffic.at | 93 +++++++++++++++++++++++++++++++++++++++++
>>>>>> 3 files changed, 129 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
>>>>>> index 921d52317..bdd307933 100644
>>>>>> --- a/lib/netdev-offload-tc.c
>>>>>> +++ b/lib/netdev-offload-tc.c
>>>>>> @@ -1488,6 +1488,30 @@ parse_put_flow_ct_action(struct tc_flower *flower,
>>>>>> return 0;
>>>>>> }
>>>>>>
>>>>>> +static bool
>>>>>> +will_tc_add_l4_checksum(struct tc_flower *flower, int type)
>>>>>> +{
>>>>>> + /* This function returns true if the tc layer will add a l4
>>>>>> checksum action
>>>>>> + * for this set action. Refer to the csum_update_flag() function
>>>>>> for
>>>>>> + * detailed logic. Note that even the kernel only supports
>>>>>> updating TCP,
>>>>>> + * UDP and ICMPv6. */
>>>>>
>>>>> This comment should be outside of the function, I think. It's strange
>>>>> to have it here. I see csum_update_flag() has a comment inside, but
>>>>> that's strange as well. That function has other style issues as well,
>>>>> there is no need to copy them.
>>>>
>>>> ACK, will clean this up in the next rev.
>>>>
>>>>>> + switch (type) {
>>>>>> + case OVS_KEY_ATTR_IPV4:
>>>>>> + case OVS_KEY_ATTR_IPV6:
>>>>>> + case OVS_KEY_ATTR_TCP:
>>>>>> + case OVS_KEY_ATTR_UDP:
>>>>>> + switch (flower->key.ip_proto) {
>>>>>> + case IPPROTO_TCP:
>>>>>> + case IPPROTO_UDP:
>>>>>> + case IPPROTO_ICMPV6:
>>>>>> + case IPPROTO_UDPLITE:
>>>>>> + return true;
>>>>>> + }
>>>>>> + break;
>>>>>> + }
>>>>>> + return false;
>>>>>> +}
>>>>>> +
>>>>>> static int
>>>>>> parse_put_flow_set_masked_action(struct tc_flower *flower,
>>>>>> struct tc_action *action,
>>>>>> @@ -1520,6 +1544,14 @@ parse_put_flow_set_masked_action(struct tc_flower
>>>>>> *flower,
>>>>>> return EOPNOTSUPP;
>>>>>> }
>>>>>>
>>>>>> + if (flower->key.flags & TCA_FLOWER_KEY_FLAGS_IS_FRAGMENT
>>>>>
>>>>> We're using this field to make an offloading decision. We must
>>>>> also set in the mask. If for some reason we're not matching on
>>>>> the fragment bits, we may first receive a non-fragmented packet
>>>>> and offload it, then fragmented traffic may match and fail the
>>>>> checksumming. So, we need to enable the bit in the mask.
>>>>
>>>> The dp always matches on the fragment bit for IPv4 packets, I did some
>>>> tests with this.
>>>> So if we sent a non-fragment packet first the dp rule will match on
>>>> fragment = 0. Or
>>>> did I miss something?
>>>
>>> It is true today, but nothing ensures that on the netdev-offload-tc level.
>>> Moreover, the netdev-offload-tc is written in a way that it expects the
>>> frag bits to potentially not be in the mask:
>>> https://github.com/openvswitch/ovs/blob/1d681ffe3b208a0db4945b6389142ab18404a4d1/lib/netdev-offload-tc.c#L2432-L2448
>>> So, it is going to be internally inconsistent if we do not set the bit
>>> explicitly.
>>>
>>> And if someday we'll stop adding the frag bit, we'll never know if we
>>> forget to set it in netdev-offload-tc. At the very least we'll need an
>>> assertion that it is set. But having an assertion will still be internally
>>> inconsistent with the code linked above. So, it may be better to just fix
>>> it instead anyway.
>>
>> I finally got some time to look at the code, and I can add the below abort()
>> in case OVS decides to no longer mask out the frag bits by default.
>>
>> However, I have no clear idea how to make this work if we ever stop adding
>> this frag mask. I don't think we can report back to request a wider mask or
>> track what exists and force a revalidate.
>>
>> @@ -2447,6 +2479,8 @@ netdev_tc_flow_put(struct netdev *netdev, struct match
>> *match,
>> }
>>
>> mask->nw_frag = 0;
>> + } else {
>> + ovs_abort(0, "TC offload assumes nw_frag is always masked");
>> }
>
> We don't need to abort. We should just narrow down the match by adding
> frag bits to the mask here. Datapath implementations are allowed to
> narrow down the match, they are not allowed to make them wider.
I found a corner case where we do not set the frag mask correctly (it might be
a bug). It’s testcase 'datapath - ping between two ports on cvlan’, which
creates the following DP rule:
recirc_id(0),in_port(ovs-p0),eth(src=6e:48:c8:77:d3:8c,dst=12:f6:8b:52:f9:75),eth_type(0x88a8),vlan(vid=4094),encap(eth_type(0x0000),vlan(vid=100,pcp=0),encap(eth_type(0x0800)))
I tried setting the mask, mask->nw_frag = FLOW_NW_FRAG_MASK, but it will not
add the DP rule with the frag match.
I’ll try to dig into this a bit further as it might be a bug with the cvlan
code.
>>>>>> + && will_tc_add_l4_checksum(flower, type)) {
>>>>>> + VLOG_DBG_RL(&rl, "set action type %d not supported on fragments
>>>>>> "
>>>>>> + "due to checksum limitation", type);
>>>>>> + ofpbuf_uninit(&set_buf);
>>>>>> + return EOPNOTSUPP;
>>>>>> + }
>>>>>> +
>>>>>> for (i = 0; i < ARRAY_SIZE(set_flower_map[type]); i++) {
>>>>>> struct netlink_field *f = &set_flower_map[type][i];
>>>>>>
>>>>>> diff --git a/lib/tc.c b/lib/tc.c
>>>>>> index e9bcae4e4..20472d13c 100644
>>>>>> --- a/lib/tc.c
>>>>>> +++ b/lib/tc.c
>>>>>> @@ -2958,7 +2958,10 @@ csum_update_flag(struct tc_flower *flower,
>>>>>> * eth(dst=<mac>),eth_type(0x0800) actions=set(ipv4(src=<new_ip>))
>>>>>> * we need to force a more specific flow as this can, for example,
>>>>>> * need a recalculation of icmp checksum if the packet that passes
>>>>>> - * is ICMPv6 and tcp checksum if its tcp. */
>>>>>> + * is ICMPv6 and tcp checksum if its tcp.
>>>>>> + *
>>>>>> + * Ensure that you keep the pre-check function in
>>>>>> netdev-offload-tc.c,
>>>>>> + * will_tc_add_l4_checksum(), in sync if you make any changes here.
>>>>>> */
>>>>>>
>>>>>> switch (htype) {
>>>>>> case TCA_PEDIT_KEY_EX_HDR_TYPE_IP4:
>>>>>> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
>>>>>> index bd7647cbe..8f392abd5 100644
>>>>>> --- a/tests/system-traffic.at
>>>>>> +++ b/tests/system-traffic.at
>>>>>> @@ -8977,3 +8977,96 @@ OVS_WAIT_UNTIL([cat p2.pcap | grep -E "0x0050:
>>>>>> *0000 *0000 *5002 *2000 *b85e *00
>>>>>>
>>>>>> OVS_TRAFFIC_VSWITCHD_STOP
>>>>>> AT_CLEANUP
>>>>>> +
>>>>>> +AT_SETUP([datapath - mod_nw_src/set_field on IP fragments])
>>>>>> +AT_SKIP_IF([test $HAVE_TCPDUMP = no])
>>>>>> +OVS_TRAFFIC_VSWITCHD_START()
>>>>>> +
>>>>>> +ADD_NAMESPACES(at_ns0, at_ns1)
>>>>>> +
>>>>>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24", 36:b1:ee:7c:01:03)
>>>>>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24", 36:b1:ee:7c:01:02)
>>>>>> +
>>>>>> +AT_DATA([flows.txt], [dnl
>>>>>> + in_port=ovs-p0,ip,nw_src=10.1.1.1 actions=mod_nw_src=11.1.1.1,ovs-p1
>>>>>> + in_port=ovs-p0,ipv6,ipv6_src=fc00::1
>>>>>> actions=set_field:fc00::100->ipv6_src,ovs-p1
>>>>>> +])
>>>>>> +
>>>>>> +AT_CHECK([ovs-ofctl del-flows br0])
>>>>>> +AT_CHECK([ovs-ofctl -Oopenflow13 add-flows br0 flows.txt])
>>>>>> +
>>>>>> +NETNS_DAEMONIZE([at_ns1],
>>>>>> + [tcpdump -l -nn -xx -U -i p1 -w p1.pcap 2> tcpdump.err],
>>>>>> + [tcpdump.pid])
>>>>>> +OVS_WAIT_UNTIL([grep "listening" tcpdump.err])
>>>>>> +
>>>>>> +dnl IPv4 Packet content:
>>>>>> +dnl Ethernet II, Src: 36:b1:ee:7c:01:03, Dst: 36:b1:ee:7c:01:02
>>>>>> +dnl Type: IPv4 (0x0800)
>>>>>> +dnl Internet Protocol Version 4, Src: 10.1.1.1, Dst: 10.1.1.2
>>>>>> +dnl 0100 .... = Version: 4
>>>>>> +dnl .... 0101 = Header Length: 20 bytes (5)
>>>>>> +dnl Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
>>>>>> +dnl Total Length: 38
>>>>>> +dnl Identification: 0x0001 (1)
>>>>>> +dnl 001. .... = Flags: 0x1, More fragments
>>>>>> +dnl 0... .... = Reserved bit: Not set
>>>>>> +dnl .0.. .... = Don't fragment: Not set
>>>>>> +dnl ..1. .... = More fragments: Set
>>>>>> +dnl ...0 0000 0000 0000 = Fragment Offset: 0
>>>>>> +dnl Time to Live: 64
>>>>>> +dnl Protocol: UDP (17)
>>>>>> +dnl Header Checksum: 0x44c2
>>>>>> +dnl Data (18 bytes)
>>>>>> +eth="36 b1 ee 7c 01 02 36 b1 ee 7c 01 03 08 00"
>>>>>> +ip="45 00 00 26 00 01 20 00 40 11 44 c2 0a 01 01 01 0a 01 01 02"
>>>>>> +data="0b c4 08 84 00 26 e9 64 01 02 03 04 05 06 07 08 09 0a"
>>>>>> +packet="${eth} ${ip} ${data}"
>>>>>
>>>>> Since you're backporting the compose-packet functionality now, it's better
>>>>> if we use it here instead. We may want to get my sendpkt patch first:
>>>>> https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/
>>>>>
>>>>> And then follow a pattern from the other patches from the same set:
>>>>> https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/
>>>>>
>>>>> What do you think?
>>
>> I tried this, but no luck, as I cannot control the checksum needed on the UDP
>> side. It must include some missing data (i.e., the data in the next
>> fragment).
>
> It doesn't have to. compose-packet will just set the fragment bit,
> i.e. it'll be a full packet, but marked as a first fragment. It's
> technically allowed, IIUC. TC will bail on checksum re-calculation,
> while OVS kernel will update the existing one.
>
> And we don't need to guess the checksum, we can re-generate the packet
> after updating the fields.
The problem is that it will not catch the bug this way. TC will do a full
re-calculation, which will result in a valid checksum for UDP.
>> Also, the code lacks IPv6 fragment header support, and adding it fully might
>> not be trivial.
>
> I missed that. We're indeed missing support for generation of IPv6
> fragments.
>
>>
>> So, my suggestion is to keep it as is for now.
>>
>> Thoughts?
>
> I guess, it's fine to keep as-is for now, since we can't generate
> IPv6 fragments correctly.
>
> Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev