On 5/31/22 21:15, Frode Nordahl wrote:
> On Mon, May 30, 2022 at 5:25 PM Frode Nordahl
> <frode.nord...@canonical.com> wrote:
>>
>> On Fri, May 27, 2022 at 10:04 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>>
>>> On 5/26/22 14:53, Frode Nordahl wrote:
>>>>
>>>>
>>>> tor. 26. mai 2022, 14:45 skrev Ilya Maximets <i.maxim...@ovn.org 
>>>> <mailto:i.maxim...@ovn.org>>:
>>>>
>>>>     On 5/26/22 13:00, Frode Nordahl wrote:
>>>>     > On Wed, May 25, 2022 at 9:55 AM Frode Nordahl
>>>>     > <frode.nord...@canonical.com <mailto:frode.nord...@canonical.com>> 
>>>> wrote:
>>>>     >>
>>>>     >> On Tue, May 24, 2022 at 1:32 PM Ilya Maximets <i.maxim...@ovn.org 
>>>> <mailto:i.maxim...@ovn.org>> wrote:
>>>>     >>>
>>>>     >>> On 5/24/22 12:54, Frode Nordahl wrote:
>>>>     >>>> On Mon, May 23, 2022 at 3:49 PM Ilya Maximets <i.maxim...@ovn.org 
>>>> <mailto:i.maxim...@ovn.org>> wrote:
>>>>     >>>>>
>>>>     >>>>> On 5/21/22 12:49, Frode Nordahl wrote:
>>>>     >>>>>> On Thu, May 19, 2022 at 3:39 PM Frode Nordahl
>>>>     >>>>>> <frode.nord...@canonical.com 
>>>> <mailto:frode.nord...@canonical.com>> wrote:
>>>>     >>>>>>>
>>>>     >>>>>>> On Sat, May 14, 2022 at 2:10 AM Ilya Maximets 
>>>> <i.maxim...@ovn.org <mailto:i.maxim...@ovn.org>> wrote:
>>>>     >>>>>>>>
>>>>     >>>>>>>> On 5/13/22 10:36, Frode Nordahl wrote:
>>>>     >>>>>>>>> On Fri, Mar 11, 2022 at 2:04 PM Liam Young 
>>>> <liam.yo...@canonical.com <mailto:liam.yo...@canonical.com>> wrote:
>>>>     >>>>>>>>>>
>>>>     >>>>>>>>>> Hi,
>>>>     >>>>>>>>>>
>>>>     >>>>>>>>>> <tl;dr> Commit 355fef6f2 seems to break connectivity in my 
>>>> setup</tl;dr>
>>>>     >>>>>>>>>
>>>>     >>>>>>>>> After OVN started colocating sNAT and dNAT operations in the 
>>>> same CT
>>>>     >>>>>>>>> zone [0] the above mentioned OVS commit appears to also 
>>>> break OVN [1].
>>>>     >>>>>>>>> I have been discussing with Numan and he has found a 
>>>> correlation with
>>>>     >>>>>>>>> the behavior of the open vswitch kernel data path conntrack
>>>>     >>>>>>>>> `skb_nfct_cached` function, i.e. if that is changed to 
>>>> always return
>>>>     >>>>>>>>> 'false' the erratic behavior disappears.
>>>>     >>>>>>>>>
>>>>     >>>>>>>>> So I guess the question then becomes, is this a OVS 
>>>> userspace or OVS
>>>>     >>>>>>>>> kernel problem?
>>>>     >>>>>>>>>
>>>>     >>>>>>>>> We have a reproducer in [3].
>>>>     >>>>>>>>>
>>>>     >>>>>>>>> 0: 
>>>> https://github.com/ovn-org/ovn/commit/4deac4509abbedd6ffaecf27eed01ddefccea40a
>>>>  
>>>> <https://github.com/ovn-org/ovn/commit/4deac4509abbedd6ffaecf27eed01ddefccea40a>
>>>>     >>>>>>>>> 1: 
>>>> https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856 
>>>> <https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856>
>>>>     >>>>>>>>> 2: 
>>>> https://elixir.bootlin.com/linux/latest/source/net/openvswitch/conntrack.c#L683
>>>>  
>>>> <https://elixir.bootlin.com/linux/latest/source/net/openvswitch/conntrack.c#L683>
>>>>     >>>>>>>>> 3: 
>>>> https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856/comments/6 
>>>> <https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856/comments/6>
>>>>     >>>>>>>>>
>>>>     >>>>>>>>
>>>>     >>>>>>>> Hrm.  I think, there is a logical bug in implementations of 
>>>> ct()
>>>>     >>>>>>>> datapath action in both datapaths.
>>>>     >>>>>>>>
>>>>     >>>>>>>> The problem appears when the OpenFlow pipeline for the packet 
>>>> contains
>>>>     >>>>>>>> several ct() actions.  NXAST_CT action (i.e. every ct() 
>>>> action) must
>>>>     >>>>>>>> always put the packet into an untracked state.  Tracked state 
>>>> will be
>>>>     >>>>>>>> set only in the 'recirc_table' where the packet is cloned by 
>>>> the ct()
>>>>     >>>>>>>> action for further processing.
>>>>     >>>>>>>>
>>>>     >>>>>>>> If an OF pipeline looks like this:
>>>>     >>>>>>>>
>>>>     >>>>>>>>   actions=ct(),something,something,ct(),something
>>>>     >>>>>>>>
>>>>     >>>>>>>> each action must be entered with the 'untracked' conntrack 
>>>> state in
>>>>     >>>>>>>> the packet metadata.
>>>>     >>>>>>>>
>>>>     >>>>>>>> However, ct() action in the datapath, unlike OpenFlow, 
>>>> doesn't work
>>>>     >>>>>>>> this way.  It modifies the conntrack state for the packet it 
>>>> is processing.
>>>>     >>>>>>>> During the flow translation OVS inserts recirculation right 
>>>> after the
>>>>     >>>>>>>> datapath ct() action.  This way conntrack state affects only 
>>>> processing
>>>>     >>>>>>>> after recirculation.  Sort of.
>>>>     >>>>>>>>
>>>>     >>>>>>>> The issue is that not all OpenFlow ct() actions have 
>>>> recirc_table
>>>>     >>>>>>>> specified.  These actions supposed to change the state of the 
>>>> conntrack
>>>>     >>>>>>>> subsystem, but they still must leave the packet itself in the 
>>>> untracked
>>>>     >>>>>>>> state with no strings attached.  But since datapath ct() 
>>>> actions doesn't
>>>>     >>>>>>>> work that way and since recirculation is not inserted (no 
>>>> recirc_table
>>>>     >>>>>>>> specified), packet after conntrack execution carries the 
>>>> state along to
>>>>     >>>>>>>> all other actions.
>>>>     >>>>>>>> This doesn't impact normal actions, but seems to break 
>>>> subsequent ct()
>>>>     >>>>>>>> actions on the same pipeline.
>>>>     >>>>>>>>
>>>>     >>>>>>>> In general, it looks to me that ct() action in the datapath is
>>>>     >>>>>>>> not supposed to cache any connection data inside the packet's 
>>>> metadata.
>>>>     >>>>>>>> This seems to be the root cause of the problem.  Fields that 
>>>> OF knows
>>>>     >>>>>>>> about should not trigger any issues if carried along with a 
>>>> packet,
>>>>     >>>>>>>> but datapath-specific metadata can not be cleared, because 
>>>> OFproto
>>>>     >>>>>>>> simply doesn't know about it.
>>>>     >>>>>>>>
>>>>     >>>>>>>> But, I guess, not caching the connection and other internal 
>>>> structures
>>>>     >>>>>>>> might significantly impact datapath performance.  Will it?
>>>>     >>>>>>>>
>>>>     >>>>>>>> Looking at the reproducer in [3], it has, for example, the 
>>>> flow with the
>>>>     >>>>>>>> following actions:
>>>>     >>>>>>>>
>>>>     >>>>>>>>   actions:ct(commit,zone=8,mark=0/0x1,nat(src)),
>>>>     >>>>>>>>           
>>>> set(eth(src=00:00:00:00:00:01,dst=00:00:00:00:00:06)),
>>>>     >>>>>>>>           set(ipv4(src=172.18.2.10,dst=192.168.100.6,ttl=62)),
>>>>     >>>>>>>>           ct(zone=8),recirc(0x4)
>>>>     >>>>>>>>
>>>>     >>>>>>>> So, if the first ct() will change some datapath-specific 
>>>> packet metadata,
>>>>     >>>>>>>> the second ct() may try to use that information.
>>>>     >>>>>>>>
>>>>     >>>>>>>> It looks like during the flow translation we must add 
>>>> ct_clear action
>>>>     >>>>>>>> explicitly after every ct() action, unless it was the last 
>>>> action in
>>>>     >>>>>>>> the list.  This will end up with adding a lot of ct_clear 
>>>> actions
>>>>     >>>>>>>> everywhere.
>>>>     >>>>>>>>
>>>>     >>>>>>>> Another option is the patch below that tries to track if 
>>>> ct_clear
>>>>     >>>>>>>> is required and adds that action before the next ct() action 
>>>> in
>>>>     >>>>>>>> the datapath.
>>>>     >>>>>>>>
>>>>     >>>>>>>> BTW, the test [3] fails with both kernel and userspace 
>>>> datapaths.
>>>>     >>>>>>>>
>>>>     >>>>>>>> The change below should fix the problem, I think.
>>>>     >>>>>>>> It looks like we also have to put ct_clear action before 
>>>> translating
>>>>     >>>>>>>> output to the patch port if we're in 'conntracked' state.
>>>>     >>>>>>>>
>>>>     >>>>>>>> And I do not know how to fix the problem for kernels without 
>>>> ct_clear
>>>>     >>>>>>>> support.  I don't think we can clear metadata that is 
>>>> internal to
>>>>     >>>>>>>> the kernel.  Ideas are welcome.
>>>>     >>>>>>>>
>>>>     >>>>>>>> CC: Aaron, Paolo.
>>>>     >>>>>>>>
>>>>     >>>>>>>> Any thoughts?
>>>>     >>>>>>>
>>>>     >>>>>>> Thank you so much for the detailed explanation of what's going 
>>>> on and
>>>>     >>>>>>> for providing a proposed fix.
>>>>     >>>>>>>
>>>>     >>>>>>> I took it for a spin and it does indeed appear to fix the OVN 
>>>> hairpin
>>>>     >>>>>>> issue, but it does unfortunately not appear to fix the issue 
>>>> Liam
>>>>     >>>>>>> reported for the ML2/OVS use case [4].
>>>>     >>>>>>>
>>>>     >>>>>>> When trying that use case with your patch I see this in the 
>>>> Open vSwitch log:
>>>>     >>>>>>> 
>>>> 2022-05-19T08:17:02.668Z|00001|ofproto_dpif_xlate(handler1)|WARN|over
>>>>     >>>>>>> max translation depth 64 on bridge br-int while processing
>>>>     >>>>>>> 
>>>> ct_state=new|trk,ct_ipv6_src=fc00:24:159c:555b:f816:3eff:fe8b:3ad0,ct_ipv6_dst=fc00:24:159c:555b:f816:3eff:fe8f:9302,ct_nw_proto=58,ct_tp_src=128,ct_tp_dst=0,icmp6,in_port=3,vlan_tci=0x0000,dl_src=fa:16:3e:8b:3a:d0,dl_dst=fa:16:3e:8f:93:02,ipv6_src=fc00:24:159c:555b:f816:3eff:fe8b:3ad0,ipv6_dst=fc00:24:159c:555b:f816:3eff:fe8f:9302,ipv6_label=0xac5db,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=128,icmp_code=0
>>>>     >>>>>>> 
>>>> 2022-05-19T08:17:02.668Z|00002|ofproto_dpif_upcall(handler1)|WARN|Dropped
>>>>     >>>>>>> 2 log messages in last 205 seconds (most recently, 187 seconds 
>>>> ago)
>>>>     >>>>>>> due to excessive rate
>>>>     >>>>>>> 
>>>> 2022-05-19T08:17:02.668Z|00003|ofproto_dpif_upcall(handler1)|WARN|Flow:
>>>>     >>>>>>> 
>>>> ct_state=new|trk,ct_ipv6_src=fc00:24:159c:555b:f816:3eff:fe8b:3ad0,ct_ipv6_dst=fc00:24:159c:555b:f816:3eff:fe8f:9302,ct_nw_proto=58,ct_tp_src=128,ct_tp_dst=0,icmp6,in_port=5,vlan_tci=0x0000,dl_src=fa:16:3e:8b:3a:d0,dl_dst=fa:16:3e:8f:93:02,ipv6_src=fc00:24:159c:555b:f816:3eff:fe8b:3ad0,ipv6_dst=fc00:24:159c:555b:f816:3eff:fe8f:9302,ipv6_label=0xac5db,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=128,icmp_code=0
>>>>     >>>>>>>
>>>>     >>>>>>> bridge("br-int")
>>>>     >>>>>>> ----------------
>>>>     >>>>>>>  0. priority 0, cookie 0x56d6d58c018b51bd
>>>>     >>>>>>>     goto_table:60
>>>>     >>>>>>> 60. priority 3, cookie 0x56d6d58c018b51bd
>>>>     >>>>>>>     NORMAL
>>>>     >>>>>>>      >>>> received packet on unknown port 5 <<<<
>>>>     >>>>>>>      >> no input bundle, dropping
>>>>     >>>>>>>
>>>>     >>>>>>> Final flow: unchanged
>>>>     >>>>>>> Megaflow: 
>>>> recirc_id=0,eth,ipv6,in_port=5,dl_src=fa:16:3e:8b:3a:d0,dl_dst=fa:16:3e:8f:93:02,nw_frag=no
>>>>     >>>>>>> Datapath actions: drop
>>>>     >>>>>>>
>>>>     >>>>>>>
>>>>     >>>>>>> On the back of your explanation and the log output I made an 
>>>> attempt
>>>>     >>>>>>> at adding this patch on top of yours:
>>>>     >>>>>>> ---
>>>>     >>>>>>> diff --git a/ofproto/ofproto-dpif-xlate.c 
>>>> b/ofproto/ofproto-dpif-xlate.c
>>>>     >>>>>>> index 110dab0ec..2955e8e4d 100644
>>>>     >>>>>>> --- a/ofproto/ofproto-dpif-xlate.c
>>>>     >>>>>>> +++ b/ofproto/ofproto-dpif-xlate.c
>>>>     >>>>>>> @@ -7144,6 +7144,7 @@ do_xlate_actions(const struct ofpact 
>>>> *ofpacts,
>>>>     >>>>>>> size_t ofpacts_len,
>>>>     >>>>>>>           * resubmit to the frozen actions.
>>>>     >>>>>>>           */
>>>>     >>>>>>>          case OFPACT_RESUBMIT:
>>>>     >>>>>>> +            ctx->pending_ct_clear = true;
>>>>     >>>>>>>              xlate_ofpact_resubmit(ctx, 
>>>> ofpact_get_RESUBMIT(a), last);
>>>>     >>>>>>>              continue;
>>>>     >>>>>>>          case OFPACT_GOTO_TABLE:
>>>>     >>>>>>> ---
>>>>     >>>>>>>
>>>>     >>>>>>> And the two patches together do appear to resolve the issue 
>>>> reported
>>>>     >>>>>>> in [4] as well as the OVN hairpin issue [1]. It does however 
>>>> make a
>>>>     >>>>>>> couple of tests fail, so I need to look into if that is 
>>>> expected from
>>>>     >>>>>>> the change or if the approach must be changed.
>>>>     >>>>>>>
>>>>     >>>>>>> 4: https://bugs.launchpad.net/openvswitch/+bug/1964117 
>>>> <https://bugs.launchpad.net/openvswitch/+bug/1964117>
>>>>     >>>>>>
>>>>     >>>>>> The following patch also works and does not cause any tests to 
>>>> fail.
>>>>     >>>>>>
>>>>     >>>>>> ---
>>>>     >>>>>> diff --git a/ofproto/ofproto-dpif-xlate.c 
>>>> b/ofproto/ofproto-dpif-xlate.c
>>>>     >>>>>> index 110dab0ec..905f6994d 100644
>>>>     >>>>>> --- a/ofproto/ofproto-dpif-xlate.c
>>>>     >>>>>> +++ b/ofproto/ofproto-dpif-xlate.c
>>>>     >>>>>> @@ -7354,7 +7354,9 @@ do_xlate_actions(const struct ofpact 
>>>> *ofpacts,
>>>>     >>>>>> size_t ofpacts_len,
>>>>     >>>>>>              break;
>>>>     >>>>>>
>>>>     >>>>>>          case OFPACT_CT_CLEAR:
>>>>     >>>>>> -            if (ctx->conntracked || ctx->pending_ct_clear) {
>>>>     >>>>>> +            if (ctx->conntracked || ctx->pending_ct_clear
>>>>     >>>>>> +                || (flow->ct_state && flow->ct_state & 
>>>> CS_TRACKED))
>>>>     >>>>>> +            {
>>>>     >>>>>>                  compose_ct_clear_action(ctx);
>>>>     >>>>>>              }
>>>>     >>>>>>              break;
>>>>     >>>>>> ---
>>>>     >>>>>>
>>>>     >>>>>> The OpenStack ML2/OVS Open vSwitch firewall driver appears to 
>>>> make
>>>>     >>>>>> explicit use of ct_clear action as part of progressing packets 
>>>> through
>>>>     >>>>>> the tables and the optimization in 355fef6f2 interrupts that 
>>>> workflow.
>>>>     >>>>>>
>>>>     >>>>>> I am not quite sure if the definition of `struct
>>>>     >>>>>> xlate_ctx->conntracked` is "this flow has been exposed to 
>>>> conntrack"
>>>>     >>>>>> or if it is "conntrack sees this flow as established", if 
>>>> anyone can
>>>>     >>>>>> shed light on that we would know if the above patch is ok as is 
>>>> or if
>>>>     >>>>>> this flow state should have set struct xlate_ctx->conntracked 
>>>> to true
>>>>     >>>>>> in the first place.
>>>>     >>>>>
>>>>     >>>>> I beleive that ctx->conntracked should men that "this flow has 
>>>> been exposed
>>>>     >>>>> to conntrack".  And it looks like the condition:
>>>>     >>>>>  (!ctx->conntracked && flow->ct_state & CS_TRACKED)
>>>>     >>>>> should not be possible.  We're, probably, missing 
>>>> clear_conntrack() call
>>>>     >>>>> somewhere.
>>>>     >>>>>
>>>>     >>>>> One thing I seem to miss in my patch is the conntrack clear 
>>>> after returning
>>>>     >>>>> from the patch port processing:
>>>>     >>>>>
>>>>     >>>>> diff --git a/ofproto/ofproto-dpif-xlate.c 
>>>> b/ofproto/ofproto-dpif-xlate.c
>>>>     >>>>> index 110dab0ec..53d4f78b2 100644
>>>>     >>>>> --- a/ofproto/ofproto-dpif-xlate.c
>>>>     >>>>> +++ b/ofproto/ofproto-dpif-xlate.c
>>>>     >>>>> @@ -3955,8 +3955,13 @@ patch_port_output(struct xlate_ctx *ctx, 
>>>> const struct xport *in_dev,
>>>>     >>>>>
>>>>     >>>>>      /* The out bridge's conntrack execution should have no 
>>>> effect on the
>>>>     >>>>>       * original bridge. */
>>>>     >>>>> -    ctx->conntracked = old_conntrack;
>>>>     >>>>>      ctx->pending_ct_clear = old_ct_clear;
>>>>     >>>>> +    if (ctx->conntracked) {
>>>>     >>>>> +        /* Conntrack was involved in the other bridge.  We need 
>>>> to clear
>>>>     >>>>> +         * whatever information was cached in the datapath. */
>>>>     >>>>> +        ctx->pending_ct_clear = true;
>>>>     >>>>> +    }
>>>>     >>>>> +    ctx->conntracked = old_conntrack;
>>>>     >>>>>
>>>>     >>>>>      /* The fact that the out bridge exits (for any reason) does 
>>>> not mean
>>>>     >>>>>       * that the original bridge should exit.  Specifically, if 
>>>> the out
>>>>     >>>>> ---
>>>>     >>>>>
>>>>     >>>>> But that doesn't seem to be related to the (!ctx->conntracked &&
>>>>     >>>>> flow->ct_state & CS_TRACKED) case...
>>>>     >>>>
>>>>     >>>> Thank you for confirming that ctx->conntracked should have been 
>>>> set in
>>>>     >>>> this flow state, that led me onto the possible root and fix of 
>>>> this
>>>>     >>>> particular use case.
>>>>     >>>>
>>>>     >>>> Enabling debug log for dpif module while initiating a new flow I 
>>>> see
>>>>     >>>> that the flow can enter the system as a miss upcall without
>>>>     >>>> recirculation and with ct_state set:
>>>>     >>>> 
>>>> 2022-05-24T10:42:57.869Z|00036|dpif(handler6)|DBG|system@ovs-system:
>>>>     >>>> miss upcall:
>>>>     >>>> 
>>>> recirc_id(0),dp_hash(0),skb_priority(0),in_port(6),skb_mark(0),ct_state(0x21),ct_zone(0),ct_mark(0),ct_label(0),ct_tuple6(src=fc00:a3f0:723f:7bc4:f816:3eff:fe81:224d,dst=fc00:a3f0:723f:7bc4:f816:3eff:fee5:e7b2,proto=58,src_port=128,dst_port=0),eth(src=fa:16:3e:81:22:4d,dst=fa:16:3e:e5:e7:b2),eth_type(0x86dd),ipv6(src=fc00:a3f0:723f:7bc4:f816:3eff:fe81:224d,dst=fc00:a3f0:723f:7bc4:f816:3eff:fee5:e7b2,label=0x357f1,proto=58,tclass=0,hlimit=64,frag=no),icmpv6(type=128,code=0)
>>>>     >>>>
>>>>     >>>> Looking at the code it appears that for this state the
>>>>     >>>> ctx->conntracked would not be set, it is currently initialized to
>>>>     >>>> 'false' and only updated based on frozen state for recirculated
>>>>     >>>> packets.
>>>>     >>>>
>>>>     >>>> Adding this patch resolves inconsistency we discussed above and
>>>>     >>>> subsequently the ML2/OVS problem:
>>>>     >>>> diff --git a/ofproto/ofproto-dpif-xlate.c 
>>>> b/ofproto/ofproto-dpif-xlate.c
>>>>     >>>> index 110dab0ec..7bc7426ac 100644
>>>>     >>>> --- a/ofproto/ofproto-dpif-xlate.c
>>>>     >>>> +++ b/ofproto/ofproto-dpif-xlate.c
>>>>     >>>> @@ -7852,6 +7852,12 @@ xlate_actions(struct xlate_in *xin, struct
>>>>     >>>> xlate_out *xout)
>>>>     >>>>          goto exit;
>>>>     >>>>      }
>>>>     >>>>
>>>>     >>>> +    if (!xin->frozen_state
>>>>     >>>> +        && xin->flow.ct_state
>>>>     >>>> +        && xin->flow.ct_state & CS_TRACKED) {
>>>>     >>>> +        ctx.conntracked = true;
>>>>     >>>> +    }
>>>>     >>>> +
>>>>     >>>>      /* Tunnel metadata in udpif format must be normalized before
>>>>     >>>> translation. */
>>>>     >>>>      if (flow->tunnel.flags & FLOW_TNL_F_UDPIF) {
>>>>     >>>>          const struct tun_table *tun_tab = ofproto_get_tun_tab(
>>>>     >>>> ---
>>>>     >>>>
>>>>     >>>> What do you think?
>>>
>>> I was ready to send the kernel patch below, but when I tried to run
>>> system tests, 2 of them failed.  And then I discovered this gem:
>>>
>>>   c2926d6d1cfd ("system-traffic: Add ct tests using local stack.")
>>
>> Good catch! I only tested your proposed kernel change with the OVN
>> system tests and not the OVS ones.
>>
>>> So, it might be that your change to set the 'conntracked' to 'true'
>>> is actually the right one, since OVS system tests are expecting
>>> this scenario to work...
>>
>> Ok, I've found a usable system test for this condition, so I'm ready
>> to submit something for this bit:
>> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
>> index 239105e89..53af3d3da 100644
>> --- a/tests/system-traffic.at
>> +++ b/tests/system-traffic.at
>> @@ -6807,6 +6807,46 @@ AT_CHECK([ovs-ofctl dump-flows br0 | grep
>> table=2, | OFPROTO_CLEAR_DURATION_IDLE
>>  OVS_TRAFFIC_VSWITCHD_STOP
>>  AT_CLEANUP
>>
>> +dnl This is essentially a copy of the "datapath - ping over geneve tunnel"
>> +dnl test, and we use it to confirm OVS behavior when ct_state is set on
>> +dnl flows by the kernel without conscious action by the OVS user space code.
>> +AT_SETUP([conntrack - ct_state from outside OVS])
>> +OVS_CHECK_TUNNEL_TSO()
>> +OVS_CHECK_GENEVE()
>> +
>> +OVS_TRAFFIC_VSWITCHD_START()
>> +ADD_BR([br-underlay])
>> +
>> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
>> +AT_CHECK([ovs-ofctl add-flow br-underlay
>> "priority=100,ct_state=+trk,actions=ct_clear,resubmit(,0)"])
>> +AT_CHECK([ovs-ofctl add-flow br-underlay "priority=10,actions=normal"])
>> +
>> +ADD_NAMESPACES(at_ns0)
>> +
>> +dnl Set up underlay link from host into the namespace using veth pair.
>> +ADD_VETH(p0, at_ns0, br-underlay, "172.31.1.1/24")
>> +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
>> +AT_CHECK([ip link set dev br-underlay up])
>> +
>> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
>> +dnl linux device inside the namespace.
>> +ADD_OVS_TUNNEL([geneve], [br0], [at_gnv0], [172.31.1.1], [10.1.1.100/24])
>> +ADD_NATIVE_TUNNEL([geneve], [ns_gnv0], [at_ns0], [172.31.1.100], 
>> [10.1.1.1/24],
>> +                  [vni 0])
>> +
>> +dnl First, check the underlay
>> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 |
>> FORMAT_PING], [0], [dnl
>> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
>> +])
>> +
>> +dnl Okay, now check the overlay
>> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 |
>> FORMAT_PING], [0], [dnl
>> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
>> +])
>> +
>> +OVS_TRAFFIC_VSWITCHD_STOP
>> +AT_CLEANUP
>> +
>>  AT_BANNER([IGMP])
>>
>>  AT_SETUP([IGMP - flood under normal action])
>> ---
>>
>> I should probably add something to confirm the ct_state=+trk flow rule
>> is hit so this does not suddenly turn into a false positive.
>>
>>
>> The ct state across patch ports issue is a bit more involved to test,
>> so I wonder if we should use a OVN test for that? We could of course
>> have OVN stage it and see if it is easy to extract something to encode
>> into OVS system tests.
>>
>>> I'm starting to questioning reality here already... Will get back
>>> to that issue next week.
>>
>> Friday afternoon can have that effect on the best of us :-)
> 
> I've pushed the first part of the fix here:
> https://mail.openvswitch.org/pipermail/ovs-dev/2022-May/394450.html

Thanks!  I saw that and I tend to think that it is correct.
I'll try to test it and apply in the next couple of days.

One question about the test above: which entity actually adds
the ct_state to the packet or at which moment that happens?
I see it, but I'm not sure I fully understand that.  Looks
like I'm missing smething obvious.

> 
> For the patch port issue: Our previous discussion about the disconnect
> between how the datapath operates and how OF operates combined with
> the fact that everyone appears to currently issue a `ct_clear` action
> after change of datapath made me think. Perhaps the issue is that OVS
> currently only clears its internal state and does not do anything with
> the datapath state when "cloning" the flow over to the new
> bridge/datapath?
> 
> With the first part of the fix in place, this is sufficient to fix the
> OVN hairpin issue:
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> index 578cbfe58..cc29425b0 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -7285,6 +7285,9 @@ do_xlate_actions(const struct ofpact *ofpacts,
> size_t ofpacts_len,
>          }
> 
>          case OFPACT_CT:
> +            if (ctx->conntracked) {
> +                compose_ct_clear_action(ctx);
> +            }
>              compose_conntrack_action(ctx, ofpact_get_CT(a), last);
>              break;
> 
> ---

This seems to be a bit of an overkill, but also I'm not sure
if it actually covers all the cases, i.e. after ct() action
that doesn't require recirculation the conntracked will be false
while the ct_state may be populated.

> 
> Which makes me ask, perhaps we should just compose a ct_clear action
> at the same time as the call to `clear_conntrack(ctx);` in the
> `patch_port_output` function? Does of course feel more risky than the
> delayed approach you proposed, but if it does not work without an
> explicit ct_clear action anyway, it becomes tempting.
> 

Yeah.  In general, I agree that we kind of need to emit the
datapath action every time we do clear_conntrack() in userspace.
But at the same time I don't think we can actually do that.
The main problem, I guess, is that CT_CLEAR is not really
reversible, regardless of what reversible_actions() tells you. :/
(this is another thing we need to fix, I suppose).
So, if we add ct_clear before the patch port output, we can
revert the change in userspace and still think that we have
some ct_state while translating actions further, but that
state will be lost in the datapath and we can't recover it
in datapath without passing the packet through the conntrack
again.  So, the ct_clear should be emitted after the packet
being cloned for patch port processing.  I'm not sure what
would be a good way to implement that.

What bothers me the most is this construction:

    /* True if conntrack has been performed on this packet during processing
     * on the current bridge. This is used to determine whether conntrack
     * state from the datapath should be honored after recirculation. */
    bool conntracked;

    ....
    if (xin->frozen_state) {
        ...
        if (!state->conntracked) {
            clear_conntrack(&ctx);
        }

"whether conntrack state from the datapath should be honored
after recirculation"...  This is part of the original design
and it looks like the 'conntracked' field is an attempt to
fix the same issue that we're trying to fix now, i.e. that
after the ct() action, the packet that didn't recirculate must
be in the untracked state.  The way this was approached though
is by clearing the conntrack *after* the non-ct recirculation.

And when the datapath action for ct_clear was introduced we
got another problem with subsequent ct() actions even without
recirculation.

I'm just getting to my initial conclusion, which was:
"""
It looks like during the flow translation we must add ct_clear action
explicitly after every ct() action, unless it was the last action in
the list.  This will end up with adding a lot of ct_clear actions
everywhere.
"""

i.e. to add compose_ct_clear_action(ctx); to the end of the
compose_conntrack_action(), maybe checking for is_last_action.

That should solve all the problems..., right?

Well, it won't solve the following case:

table=0, ip, actions=ct(table=8)
table=8, ip,ct_state=+trk, actions:set_filed:"new ip"->ip_dst,ct(...)

Here the conntrack state is legit, but the tuple changed so
the cached connection is not valid for the skb anymore.

For this, your change to always clear conntrack if 'conntracked'
will help, but it may be not a good solution from the performance
point of view, e.g. we still can use the cached connection if
the tuple didn't change.

For this we can introduce another flag to track any tuple
changes occured since the last ct() action and emit the ct_clear
only in such cases.  Sounds horrifying though. :)

Another approach is to actually fix the datapath to determine if
the cache is valid or not with something like this (not even
compile-tested):

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 4a947c13c813..2d2eec8879a9 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -685,6 +685,7 @@ static bool skb_nfct_cached(struct net *net,
                            const struct ovs_conntrack_info *info,
                            struct sk_buff *skb)
 {
+       struct nf_conntrack_tuple tuple;
        enum ip_conntrack_info ctinfo;
        struct nf_conn *ct;
        bool ct_executed = true;
@@ -702,6 +703,14 @@ static bool skb_nfct_cached(struct net *net,
                return false;
        if (!nf_ct_zone_equal_any(info->ct, nf_ct_zone(ct)))
                return false;
+
+       if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb), info->family,
+                              net, &tuple))
+               return false;
+
+       if (!nf_ct_tuple_equal(tuple, &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple))
+               return false;
+
        if (info->helper) {
                struct nf_conn_help *help;
 
---

Still, I'm not sure if my code is correct or if the tuple
extraction and comparison is a performance black hole that
will defeat the purpose of the cache.

We could also set the flag in the OVS_CB(skb) on
set_{ipv4,ipv6,tcp,udp,sctp} if values changed and check
that flag inside the skb_nfct_cached().  Clear on successful
ovs_ct_execute().  That should be much more lightweight if
the performance is a concern.

Sorry for the lengthy inconclusive reply again. :)

Best regards, Ilya Maximets.
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to