On 6/27/24 15:03, Eelco Chaudron wrote:
>
>
> On 4 Jun 2024, at 13:52, Ilya Maximets wrote:
>
>> On 6/4/24 13:42, Eelco Chaudron wrote:
>>>
>>>
>>> On 1 Jun 2024, at 0:08, Ilya Maximets wrote:
>>>
>>>> On 5/7/24 15:52, Eelco Chaudron wrote:
>>>>> While offloading header modifications to TC, OVS is using {TCA_PEDIT} +
>>>>> {TCA_CSUM} combination as that it the only way to represent header
>>>>> rewrite. However, {TCA_CSUM} is unable to calculate L4 checksums for
>>>>> IP fragments.
>>>>>
>>>>> Since TC already applies fragmentation bit masking, this patch simply
>>>>> needs to prevent these packets from being processed through TC.
>>>>>
>>>>> Signed-off-by: Eelco Chaudron <[email protected]>
>>>>> ---
>>>>> v2: - Fixed and added some comments.
>>>>> - Use ovs-pcap to compare packets.
>>>>>
>>>>> NOTE: This patch needs an AVX512 fix before it can be applied.
>>>>> Intel is working on this.
>>>>> ---
>>>>> lib/netdev-offload-tc.c | 32 ++++++++++++++
>>>>> lib/tc.c | 5 ++-
>>>>> tests/system-traffic.at | 93 +++++++++++++++++++++++++++++++++++++++++
>>>>> 3 files changed, 129 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
>>>>> index 921d52317..bdd307933 100644
>>>>> --- a/lib/netdev-offload-tc.c
>>>>> +++ b/lib/netdev-offload-tc.c
>>>>> @@ -1488,6 +1488,30 @@ parse_put_flow_ct_action(struct tc_flower *flower,
>>>>> return 0;
>>>>> }
>>>>>
>>>>> +static bool
>>>>> +will_tc_add_l4_checksum(struct tc_flower *flower, int type)
>>>>> +{
>>>>> + /* This function returns true if the tc layer will add a l4 checksum
>>>>> action
>>>>> + * for this set action. Refer to the csum_update_flag() function for
>>>>> + * detailed logic. Note that even the kernel only supports updating
>>>>> TCP,
>>>>> + * UDP and ICMPv6. */
>>>>
>>>> This comment should be outside of the function, I think. It's strange
>>>> to have it here. I see csum_update_flag() has a comment inside, but
>>>> that's strange as well. That function has other style issues as well,
>>>> there is no need to copy them.
>>>
>>> ACK, will clean this up in the next rev.
>>>
>>>>> + switch (type) {
>>>>> + case OVS_KEY_ATTR_IPV4:
>>>>> + case OVS_KEY_ATTR_IPV6:
>>>>> + case OVS_KEY_ATTR_TCP:
>>>>> + case OVS_KEY_ATTR_UDP:
>>>>> + switch (flower->key.ip_proto) {
>>>>> + case IPPROTO_TCP:
>>>>> + case IPPROTO_UDP:
>>>>> + case IPPROTO_ICMPV6:
>>>>> + case IPPROTO_UDPLITE:
>>>>> + return true;
>>>>> + }
>>>>> + break;
>>>>> + }
>>>>> + return false;
>>>>> +}
>>>>> +
>>>>> static int
>>>>> parse_put_flow_set_masked_action(struct tc_flower *flower,
>>>>> struct tc_action *action,
>>>>> @@ -1520,6 +1544,14 @@ parse_put_flow_set_masked_action(struct tc_flower
>>>>> *flower,
>>>>> return EOPNOTSUPP;
>>>>> }
>>>>>
>>>>> + if (flower->key.flags & TCA_FLOWER_KEY_FLAGS_IS_FRAGMENT
>>>>
>>>> We're using this field to make an offloading decision. We must
>>>> also set in the mask. If for some reason we're not matching on
>>>> the fragment bits, we may first receive a non-fragmented packet
>>>> and offload it, then fragmented traffic may match and fail the
>>>> checksumming. So, we need to enable the bit in the mask.
>>>
>>> The dp always matches on the fragment bit for IPv4 packets, I did some
>>> tests with this.
>>> So if we sent a non-fragment packet first the dp rule will match on
>>> fragment = 0. Or
>>> did I miss something?
>>
>> It is true today, but nothing ensures that on the netdev-offload-tc level.
>> Moreover, the netdev-offload-tc is written in a way that it expects the
>> frag bits to potentially not be in the mask:
>>
>> https://github.com/openvswitch/ovs/blob/1d681ffe3b208a0db4945b6389142ab18404a4d1/lib/netdev-offload-tc.c#L2432-L2448
>> So, it is going to be internally inconsistent if we do not set the bit
>> explicitly.
>>
>> And if someday we'll stop adding the frag bit, we'll never know if we
>> forget to set it in netdev-offload-tc. At the very least we'll need an
>> assertion that it is set. But having an assertion will still be internally
>> inconsistent with the code linked above. So, it may be better to just fix
>> it instead anyway.
>
> I finally got some time to look at the code, and I can add the below abort()
> in case OVS decides to no longer mask out the frag bits by default.
>
> However, I have no clear idea how to make this work if we ever stop adding
> this frag mask. I don't think we can report back to request a wider mask or
> track what exists and force a revalidate.
>
> @@ -2447,6 +2479,8 @@ netdev_tc_flow_put(struct netdev *netdev, struct match
> *match,
> }
>
> mask->nw_frag = 0;
> + } else {
> + ovs_abort(0, "TC offload assumes nw_frag is always masked");
> }
We don't need to abort. We should just narrow down the match by adding
frag bits to the mask here. Datapath implementations are allowed to
narrow down the match, they are not allowed to make them wider.
>
>>>>> + && will_tc_add_l4_checksum(flower, type)) {
>>>>> + VLOG_DBG_RL(&rl, "set action type %d not supported on fragments "
>>>>> + "due to checksum limitation", type);
>>>>> + ofpbuf_uninit(&set_buf);
>>>>> + return EOPNOTSUPP;
>>>>> + }
>>>>> +
>>>>> for (i = 0; i < ARRAY_SIZE(set_flower_map[type]); i++) {
>>>>> struct netlink_field *f = &set_flower_map[type][i];
>>>>>
>>>>> diff --git a/lib/tc.c b/lib/tc.c
>>>>> index e9bcae4e4..20472d13c 100644
>>>>> --- a/lib/tc.c
>>>>> +++ b/lib/tc.c
>>>>> @@ -2958,7 +2958,10 @@ csum_update_flag(struct tc_flower *flower,
>>>>> * eth(dst=<mac>),eth_type(0x0800) actions=set(ipv4(src=<new_ip>))
>>>>> * we need to force a more specific flow as this can, for example,
>>>>> * need a recalculation of icmp checksum if the packet that passes
>>>>> - * is ICMPv6 and tcp checksum if its tcp. */
>>>>> + * is ICMPv6 and tcp checksum if its tcp.
>>>>> + *
>>>>> + * Ensure that you keep the pre-check function in
>>>>> netdev-offload-tc.c,
>>>>> + * will_tc_add_l4_checksum(), in sync if you make any changes here.
>>>>> */
>>>>>
>>>>> switch (htype) {
>>>>> case TCA_PEDIT_KEY_EX_HDR_TYPE_IP4:
>>>>> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
>>>>> index bd7647cbe..8f392abd5 100644
>>>>> --- a/tests/system-traffic.at
>>>>> +++ b/tests/system-traffic.at
>>>>> @@ -8977,3 +8977,96 @@ OVS_WAIT_UNTIL([cat p2.pcap | grep -E "0x0050:
>>>>> *0000 *0000 *5002 *2000 *b85e *00
>>>>>
>>>>> OVS_TRAFFIC_VSWITCHD_STOP
>>>>> AT_CLEANUP
>>>>> +
>>>>> +AT_SETUP([datapath - mod_nw_src/set_field on IP fragments])
>>>>> +AT_SKIP_IF([test $HAVE_TCPDUMP = no])
>>>>> +OVS_TRAFFIC_VSWITCHD_START()
>>>>> +
>>>>> +ADD_NAMESPACES(at_ns0, at_ns1)
>>>>> +
>>>>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24", 36:b1:ee:7c:01:03)
>>>>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24", 36:b1:ee:7c:01:02)
>>>>> +
>>>>> +AT_DATA([flows.txt], [dnl
>>>>> + in_port=ovs-p0,ip,nw_src=10.1.1.1 actions=mod_nw_src=11.1.1.1,ovs-p1
>>>>> + in_port=ovs-p0,ipv6,ipv6_src=fc00::1
>>>>> actions=set_field:fc00::100->ipv6_src,ovs-p1
>>>>> +])
>>>>> +
>>>>> +AT_CHECK([ovs-ofctl del-flows br0])
>>>>> +AT_CHECK([ovs-ofctl -Oopenflow13 add-flows br0 flows.txt])
>>>>> +
>>>>> +NETNS_DAEMONIZE([at_ns1],
>>>>> + [tcpdump -l -nn -xx -U -i p1 -w p1.pcap 2> tcpdump.err],
>>>>> + [tcpdump.pid])
>>>>> +OVS_WAIT_UNTIL([grep "listening" tcpdump.err])
>>>>> +
>>>>> +dnl IPv4 Packet content:
>>>>> +dnl Ethernet II, Src: 36:b1:ee:7c:01:03, Dst: 36:b1:ee:7c:01:02
>>>>> +dnl Type: IPv4 (0x0800)
>>>>> +dnl Internet Protocol Version 4, Src: 10.1.1.1, Dst: 10.1.1.2
>>>>> +dnl 0100 .... = Version: 4
>>>>> +dnl .... 0101 = Header Length: 20 bytes (5)
>>>>> +dnl Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
>>>>> +dnl Total Length: 38
>>>>> +dnl Identification: 0x0001 (1)
>>>>> +dnl 001. .... = Flags: 0x1, More fragments
>>>>> +dnl 0... .... = Reserved bit: Not set
>>>>> +dnl .0.. .... = Don't fragment: Not set
>>>>> +dnl ..1. .... = More fragments: Set
>>>>> +dnl ...0 0000 0000 0000 = Fragment Offset: 0
>>>>> +dnl Time to Live: 64
>>>>> +dnl Protocol: UDP (17)
>>>>> +dnl Header Checksum: 0x44c2
>>>>> +dnl Data (18 bytes)
>>>>> +eth="36 b1 ee 7c 01 02 36 b1 ee 7c 01 03 08 00"
>>>>> +ip="45 00 00 26 00 01 20 00 40 11 44 c2 0a 01 01 01 0a 01 01 02"
>>>>> +data="0b c4 08 84 00 26 e9 64 01 02 03 04 05 06 07 08 09 0a"
>>>>> +packet="${eth} ${ip} ${data}"
>>>>
>>>> Since you're backporting the compose-packet functionality now, it's better
>>>> if we use it here instead. We may want to get my sendpkt patch first:
>>>> https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/
>>>>
>>>> And then follow a pattern from the other patches from the same set:
>>>> https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/
>>>>
>>>> What do you think?
>
> I tried this, but no luck, as I cannot control the checksum needed on the UDP
> side. It must include some missing data (i.e., the data in the next fragment).
It doesn't have to. compose-packet will just set the fragment bit,
i.e. it'll be a full packet, but marked as a first fragment. It's
technically allowed, IIUC. TC will bail on checksum re-calculation,
while OVS kernel will update the existing one.
And we don't need to guess the checksum, we can re-generate the packet
after updating the fields.
> Also, the code lacks IPv6 fragment header support, and adding it fully might
> not be trivial.
I missed that. We're indeed missing support for generation of IPv6
fragments.
>
> So, my suggestion is to keep it as is for now.
>
> Thoughts?
I guess, it's fine to keep as-is for now, since we can't generate
IPv6 fragments correctly.
Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev