Hi Dumitru
Thanks for the patch. It clearly decreases the failure count.
I am afraid however it does not fix all potential issues.
My understanding is the following:
The test goal was to ensure ovs does not disconnect the connections on
inactivity probe.There are 4 connections with inactivity probe (3 in
earlier OVN versions):
- feature
- pinctrl
- statctrl
- ofctrl
For those connections, ovs will send ECHO REQUEST and expects
ovn-controller to reply with ECHO REPLY, within 60 seconds.
The patch you sent ensures ofctrl runs between the warp, so ensures
that ovn-controller replied with ECHO REPLY to the ECHO_REQUEST for
ofctrl.
However, ECHO REQUEST for pinctrl connection is handled by the pinctrl
thread, and nothing guarantees that pinctrl had time to run between
two warps, so that it handled the ECHO REQUEST within those
(simulated) 60 seconds.
Similar issue happens with statctrl.
Then, another issue can occur with the "feature" connection:
ovn-controller handles the ECHO_REQUEST for that connection in the
main ovn-controller loop, but only if there is an ovs idl transaction
available ("ovs db is not read-only").
So, if ovsdb is very slow (and still handling the ovn-installed-ts
insertion for instance), ovn-controller will not get any ovs idl
transaction available, and won't handle the ECHO_REQUEST for the
"feature" connection, or at least not in time.
I am not sure how to fix those issues without making the test overly complex.
Thanks
Xavier
On Thu, Nov 16, 2023 at 10:39 AM Ales Musil <[email protected]> wrote:
>
> On Wed, Nov 15, 2023 at 2:40 PM Dumitru Ceara <[email protected]> wrote:
>
> > The test incorrectly assumed that "ovn-nbctl --wait=hv sync" will always
> > send an OpenFlow barrier to ovs-vswitchd. That doesn't happen unless
> > there are other OpenFlow (rule or group) changes that need to be
> > programmed in the datapath.
> >
> > Fix the test by actually installing new rules in between warp
> > invocations.
> >
> > Spotted in CI, mostly on oversubscribed systems. A log sample that
> > shows that ovn-controller didn't generate any barrier request for the
> > nbctl sync request:
> >
> > 2023-11-15T12:12:22.937Z|00084|vconn|DBG|unix#4: received:
> > OFPT_BARRIER_REQUEST (OF1.5) (xid=0x13):
> > 2023-11-15T12:12:22.937Z|00085|vconn|DBG|unix#4: sent (Success):
> > OFPT_BARRIER_REPLY (OF1.5) (xid=0x13):
> > ...
> > 2023-11-15T12:12:23.032Z|00090|unixctl|DBG|received request
> > time/warp["60000"], id=0
> > 2023-11-15T12:12:23.032Z|00091|unixctl|DBG|replying with success, id=0:
> > "warped"
> > 2023-11-15T12:12:23.042Z|00094|vconn|DBG|unix#3: sent (Success):
> > OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
> > 2023-11-15T12:12:23.042Z|00095|vconn|DBG|unix#4: sent (Success):
> > OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
> > 2023-11-15T12:12:23.042Z|00097|vconn|DBG|unix#5: sent (Success):
> > OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
> > 2023-11-15T12:12:23.042Z|00098|unixctl|DBG|received request
> > time/warp["60000"], id=0
> > 2023-11-15T12:12:23.042Z|00099|unixctl|DBG|replying with success, id=0:
> > "warped"
> > 2023-11-15T12:12:23.052Z|00100|rconn|ERR|br-int<->unix#3: no response to
> > inactivity probe after 60 seconds, disconnecting
> > 2023-11-15T12:12:23.052Z|00101|rconn|ERR|br-int<->unix#4: no response to
> > inactivity probe after 60 seconds, disconnecting
> > 2023-11-15T12:12:23.052Z|00102|rconn|ERR|br-int<->unix#5: no response to
> > inactivity probe after 60 seconds, disconnecting
> > 2023-11-15T12:12:23.052Z|00103|unixctl|DBG|received request
> > time/warp["60000"], id=0
> >
> > Fixes: ff00142808dc ("controller: Fixed ovs/ovn(features) connection lost
> > when running more than 120 seconds")
> > Signed-off-by: Dumitru Ceara <[email protected]>
> > ---
> > tests/ovn.at | 15 +++++++++++++++
> > 1 file changed, 15 insertions(+)
> >
> > diff --git a/tests/ovn.at b/tests/ovn.at
> > index b8c61f87fb..face39e5c2 100644
> > --- a/tests/ovn.at
> > +++ b/tests/ovn.at
> > @@ -35352,20 +35352,35 @@ as hv1
> > check ovs-vsctl add-br br-phys
> > ovn_attach n1 br-phys 192.168.0.1
> >
> > +check ovn-nbctl ls-add ls
> > +
> > dnl Ensure that there are 4 openflow connections.
> > OVS_WAIT_UNTIL([test "$(grep -c 'negotiated OpenFlow version'
> > hv1/ovs-vswitchd.log)" -eq "4"])
> >
> > +as hv1 check ovs-vsctl -- add-port br-int lsp0 \
> > + -- set Interface lsp0 external-ids:iface-id=lsp0 \
> > + -- add-port br-int lsp1 \
> > + -- set Interface lsp1 external-ids:iface-id=lsp1 \
> > + -- add-port br-int lsp2 \
> > + -- set Interface lsp2 external-ids:iface-id=lsp2
> > +
> > dnl "Wait" 3 times 60 seconds and ensure ovn-controller writes to the
> > dnl openflow connections in the meantime. This should allow ovs-vswitchd
> > dnl to probe the openflow connections at least twice.
> >
> > as hv1 ovs-appctl time/warp 60000
> > +check ovn-nbctl lsp-add ls lsp0
> > +wait_for_ports_up
> > check ovn-nbctl --wait=hv sync
> >
> > as hv1 ovs-appctl time/warp 60000
> > +check ovn-nbctl lsp-add ls lsp1
> > +wait_for_ports_up
> > check ovn-nbctl --wait=hv sync
> >
> > as hv1 ovs-appctl time/warp 60000
> > +check ovn-nbctl lsp-add ls lsp2
> > +wait_for_ports_up
> > check ovn-nbctl --wait=hv sync
> >
> > AT_CHECK([test -z "`grep disconnecting hv1/ovs-vswitchd.log`"])
> > --
> > 2.39.3
> >
> > _______________________________________________
> > dev mailing list
> > [email protected]
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
> >
>
> Looks good to me, thanks.
>
> Acked-by: Ales Musil <[email protected]>
>
> --
>
> Ales Musil
>
> Senior Software Engineer - OVN Core
>
> Red Hat EMEA <https://www.redhat.com>
>
> [email protected]
> <https://red.ht/sig>
> _______________________________________________
> dev mailing list
> [email protected]
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev