On 11/16/23 12:59, Dumitru Ceara wrote:
> On 11/16/23 12:49, Xavier Simonart wrote:
>> Hi Dumitru
>>
>
> Hi Xavier,
>
>> Thanks for the patch. It clearly decreases the failure count.
>> I am afraid however it does not fix all potential issues.
>> My understanding is the following:
>> The test goal was to ensure ovs does not disconnect the connections on
>> inactivity probe.There are 4 connections with inactivity probe (3 in
>> earlier OVN versions):
>> - feature
>> - pinctrl
>> - statctrl
>> - ofctrl
>> For those connections, ovs will send ECHO REQUEST and expects
>> ovn-controller to reply with ECHO REPLY, within 60 seconds.
>> The patch you sent ensures ofctrl runs between the warp, so ensures
>> that ovn-controller replied with ECHO REPLY to the ECHO_REQUEST for
>> ofctrl.
>> However, ECHO REQUEST for pinctrl connection is handled by the pinctrl
>> thread, and nothing guarantees that pinctrl had time to run between
>> two warps, so that it handled the ECHO REQUEST within those
>> (simulated) 60 seconds.
>> Similar issue happens with statctrl.
>
> I had missed those cases..
>
>> Then, another issue can occur with the "feature" connection:
>> ovn-controller handles the ECHO_REQUEST for that connection in the
>> main ovn-controller loop, but only if there is an ovs idl transaction
>> available ("ovs db is not read-only").
>> So, if ovsdb is very slow (and still handling the ovn-installed-ts
>> insertion for instance), ovn-controller will not get any ovs idl
>> transaction available, and won't handle the ECHO_REQUEST for the
>> "feature" connection, or at least not in time.
>> I am not sure how to fix those issues without making the test overly complex.
>>
>
> +1
>
> Also, the test itself is quite artificial, I suggest we remove it all
> together. If you agree with that, I can post v2.
>
It's probably easier to just discuss on the patch itself. I posted v2:
https://patchwork.ozlabs.org/project/ovn/patch/[email protected]/
Regards,
Dumitru
> Thanks,
> Dumitru
>
>> Thanks
>> Xavier
>>
>> On Thu, Nov 16, 2023 at 10:39 AM Ales Musil <[email protected]> wrote:
>>>
>>> On Wed, Nov 15, 2023 at 2:40 PM Dumitru Ceara <[email protected]> wrote:
>>>
>>>> The test incorrectly assumed that "ovn-nbctl --wait=hv sync" will always
>>>> send an OpenFlow barrier to ovs-vswitchd. That doesn't happen unless
>>>> there are other OpenFlow (rule or group) changes that need to be
>>>> programmed in the datapath.
>>>>
>>>> Fix the test by actually installing new rules in between warp
>>>> invocations.
>>>>
>>>> Spotted in CI, mostly on oversubscribed systems. A log sample that
>>>> shows that ovn-controller didn't generate any barrier request for the
>>>> nbctl sync request:
>>>>
>>>> 2023-11-15T12:12:22.937Z|00084|vconn|DBG|unix#4: received:
>>>> OFPT_BARRIER_REQUEST (OF1.5) (xid=0x13):
>>>> 2023-11-15T12:12:22.937Z|00085|vconn|DBG|unix#4: sent (Success):
>>>> OFPT_BARRIER_REPLY (OF1.5) (xid=0x13):
>>>> ...
>>>> 2023-11-15T12:12:23.032Z|00090|unixctl|DBG|received request
>>>> time/warp["60000"], id=0
>>>> 2023-11-15T12:12:23.032Z|00091|unixctl|DBG|replying with success, id=0:
>>>> "warped"
>>>> 2023-11-15T12:12:23.042Z|00094|vconn|DBG|unix#3: sent (Success):
>>>> OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
>>>> 2023-11-15T12:12:23.042Z|00095|vconn|DBG|unix#4: sent (Success):
>>>> OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
>>>> 2023-11-15T12:12:23.042Z|00097|vconn|DBG|unix#5: sent (Success):
>>>> OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
>>>> 2023-11-15T12:12:23.042Z|00098|unixctl|DBG|received request
>>>> time/warp["60000"], id=0
>>>> 2023-11-15T12:12:23.042Z|00099|unixctl|DBG|replying with success, id=0:
>>>> "warped"
>>>> 2023-11-15T12:12:23.052Z|00100|rconn|ERR|br-int<->unix#3: no response to
>>>> inactivity probe after 60 seconds, disconnecting
>>>> 2023-11-15T12:12:23.052Z|00101|rconn|ERR|br-int<->unix#4: no response to
>>>> inactivity probe after 60 seconds, disconnecting
>>>> 2023-11-15T12:12:23.052Z|00102|rconn|ERR|br-int<->unix#5: no response to
>>>> inactivity probe after 60 seconds, disconnecting
>>>> 2023-11-15T12:12:23.052Z|00103|unixctl|DBG|received request
>>>> time/warp["60000"], id=0
>>>>
>>>> Fixes: ff00142808dc ("controller: Fixed ovs/ovn(features) connection lost
>>>> when running more than 120 seconds")
>>>> Signed-off-by: Dumitru Ceara <[email protected]>
>>>> ---
>>>> tests/ovn.at | 15 +++++++++++++++
>>>> 1 file changed, 15 insertions(+)
>>>>
>>>> diff --git a/tests/ovn.at b/tests/ovn.at
>>>> index b8c61f87fb..face39e5c2 100644
>>>> --- a/tests/ovn.at
>>>> +++ b/tests/ovn.at
>>>> @@ -35352,20 +35352,35 @@ as hv1
>>>> check ovs-vsctl add-br br-phys
>>>> ovn_attach n1 br-phys 192.168.0.1
>>>>
>>>> +check ovn-nbctl ls-add ls
>>>> +
>>>> dnl Ensure that there are 4 openflow connections.
>>>> OVS_WAIT_UNTIL([test "$(grep -c 'negotiated OpenFlow version'
>>>> hv1/ovs-vswitchd.log)" -eq "4"])
>>>>
>>>> +as hv1 check ovs-vsctl -- add-port br-int lsp0 \
>>>> + -- set Interface lsp0 external-ids:iface-id=lsp0 \
>>>> + -- add-port br-int lsp1 \
>>>> + -- set Interface lsp1 external-ids:iface-id=lsp1 \
>>>> + -- add-port br-int lsp2 \
>>>> + -- set Interface lsp2 external-ids:iface-id=lsp2
>>>> +
>>>> dnl "Wait" 3 times 60 seconds and ensure ovn-controller writes to the
>>>> dnl openflow connections in the meantime. This should allow ovs-vswitchd
>>>> dnl to probe the openflow connections at least twice.
>>>>
>>>> as hv1 ovs-appctl time/warp 60000
>>>> +check ovn-nbctl lsp-add ls lsp0
>>>> +wait_for_ports_up
>>>> check ovn-nbctl --wait=hv sync
>>>>
>>>> as hv1 ovs-appctl time/warp 60000
>>>> +check ovn-nbctl lsp-add ls lsp1
>>>> +wait_for_ports_up
>>>> check ovn-nbctl --wait=hv sync
>>>>
>>>> as hv1 ovs-appctl time/warp 60000
>>>> +check ovn-nbctl lsp-add ls lsp2
>>>> +wait_for_ports_up
>>>> check ovn-nbctl --wait=hv sync
>>>>
>>>> AT_CHECK([test -z "`grep disconnecting hv1/ovs-vswitchd.log`"])
>>>> --
>>>> 2.39.3
>>>>
>>>> _______________________________________________
>>>> dev mailing list
>>>> [email protected]
>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>
>>>>
>>>
>>> Looks good to me, thanks.
>>>
>>> Acked-by: Ales Musil <[email protected]>
>>>
>>> --
>>>
>>> Ales Musil
>>>
>>> Senior Software Engineer - OVN Core
>>>
>>> Red Hat EMEA <https://www.redhat.com>
>>>
>>> [email protected]
>>> <https://red.ht/sig>
>>> _______________________________________________
>>> dev mailing list
>>> [email protected]
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev