On Mon, Feb 4, 2019 at 6:52 PM Ilya Maximets <[email protected]> wrote:

> On 04.02.2019 13:39, [email protected] wrote:
> > From: Numan Siddique <[email protected]>
> >
> > On a chassis when ovs-vswitchd crashes for some reason, the BFD status
> doesn't
> > get updated in the ovs db. ovn-controller will be reading the old BFD
> status
> > even though ovs-vswitchd is crashed. This results in the chassiredirect
> port
> > claim flapping between the master chassis and the chasiss with the next
> higher
> > priority if ovs-vswitchd crashes in the master chassis.
> >
> > All the other chassis notices the BFD status down with the master chassis
> > and hence the next higher priority claims the port. But according to
> > the master chassis, the BFD status is fine and it again claims back the
> > chassisredirect port. And this results in flapping. The issue gets
> resolved
> > when ovs-vswitchd comes back but until then it leads to lot of SB DB
> > transactions and high CPU usage in ovn-controller's.
> >
> > This patch fixes the issue by checking the OF connection status of the
> > ovn-controller with ovs-vswitchd and calculates the active bfd tunnels
> > only if it's connected.
> >
> > Signed-off-by: Numan Siddique <[email protected]>
> > ---
> >
> > v1 -> v2
> > -----
> >  * Deleted unnecessary prints in the test case which were added during
> >    debugging.
> >
> >  ovn/controller/ofctrl.c         |  6 ++++++
> >  ovn/controller/ofctrl.h         |  3 +++
> >  ovn/controller/ovn-controller.c | 13 ++++++++++++-
> >  tests/ovn.at                    | 26 +++++++++++++++++++++++++-
> >  4 files changed, 46 insertions(+), 2 deletions(-)
> >
> > diff --git a/ovn/controller/ofctrl.c b/ovn/controller/ofctrl.c
> > index 218612787..95b95b607 100644
> > --- a/ovn/controller/ofctrl.c
> > +++ b/ovn/controller/ofctrl.c
> > @@ -1265,3 +1265,9 @@ ofctrl_inject_pkt(const struct ovsrec_bridge
> *br_int, const char *flow_s,
> >
> >      return NULL;
> >  }
> > +
> > +bool
> > +ofctrl_is_connected(void)
> > +{
> > +    return rconn_is_connected(swconn);
> > +}
> > diff --git a/ovn/controller/ofctrl.h b/ovn/controller/ofctrl.h
> > index ae0cfa513..f7521801b 100644
> > --- a/ovn/controller/ofctrl.h
> > +++ b/ovn/controller/ofctrl.h
> > @@ -60,4 +60,7 @@ void ofctrl_check_and_add_flow(struct hmap
> *desired_flows, uint8_t table_id,
> >                                 const struct match *,
> >                                 const struct ofpbuf *ofpacts,
> >                                 bool log_duplicate_flow);
> > +
> > +bool ofctrl_is_connected(void);
> > +
> >  #endif /* ovn/ofctrl.h */
> > diff --git a/ovn/controller/ovn-controller.c
> b/ovn/controller/ovn-controller.c
> > index 4e9a5865f..2098f280c 100644
> > --- a/ovn/controller/ovn-controller.c
> > +++ b/ovn/controller/ovn-controller.c
> > @@ -689,7 +689,18 @@ main(int argc, char *argv[])
> >                      ovsrec_bridge_table_get(ovs_idl_loop.idl), br_int,
> >                      sbrec_chassis_table_get(ovnsb_idl_loop.idl),
> chassis_id,
> >                      sbrec_sb_global_first(ovnsb_idl_loop.idl));
> > -                bfd_calculate_active_tunnels(br_int, &active_tunnels);
> > +
> > +                if (ofctrl_is_connected()) {
> > +                    /* Calculate the active tunnels only if have an an
> active
> > +                     * OpenFlow connection to br-int.
> > +                     * If we don't have a connection to br-int, it
> could mean
> > +                     * ovs-vswitchd is down for some reason and the BFD
> status
> > +                     * in the Interface rows could be stale. So its
> better to
> > +                     * consider 'active_tunnels' set to be empty if
> it's not
> > +                     * connected. */
> > +                    bfd_calculate_active_tunnels(br_int,
> &active_tunnels);
> > +                }
> > +
> >                  binding_run(ovnsb_idl_txn, ovs_idl_txn,
> sbrec_chassis_by_name,
> >                              sbrec_datapath_binding_by_key,
> >                              sbrec_port_binding_by_datapath,
> > diff --git a/tests/ovn.at b/tests/ovn.at
> > index f54f24c74..fd558cb98 100644
> > --- a/tests/ovn.at
> > +++ b/tests/ovn.at
> > @@ -8055,7 +8055,31 @@ ovn-nbctl --timeout=3 --wait=hv \
> >
> >  test_ip_packet gw2 gw1
> >
> > -OVN_CLEANUP([hv1],[gw1],[gw2],[ext1])
> > +# Get the claim count of both gw1 and gw2.
> > +gw1_claim_ct=`grep "cr-alice: Claiming" gw1/ovn-controller.log | wc -l`
> > +gw2_claim_ct=`grep "cr-alice: Claiming" gw2/ovn-controller.log | wc -l`
> > +
> > +# Kill ovs-vswitchd in gw2. gw1 should claim the gateway port.
> > +kill `cat gw2/ovs-vswitchd.pid`
>
> Is it necessary to hard kill the daemon ?
> Test works fine with OVS_APP_EXIT_AND_WAIT([ovs-vswitchd]) too.
>

I tried this but it didn't work for me. I will test again and update it.


> Anyway, simple 'kill' is not a hard kill. It's trapped by OVS and
> OVS executes some at_exit handlers. For example, pid file is removed,
> so, the below 'appctl exit' command always fails:
>
> 2019-02-04T13:06:58Z|00001|daemon_unix|WARN|/tests/testsuite.dir/2740/gw2/ovs-vswitchd.pid:
> open: No such file or directory
> ovs-appctl: cannot read pidfile
> "/tests/testsuite.dir/2740/gw2/ovs-vswitchd.pid" (No such file or directory)
>
> > +
> > +# gw1 should claim the cr-alice and the claim count of gw1 should be
> > +# incremented by 1.
> > +gw1_claim_ct=$((gw1_claim_ct+1))
> > +
> > +OVS_WAIT_UNTIL([test $gw1_claim_ct = `cat gw1/ovn-controller.log \
> > +| grep -c "cr-alice: Claiming"`])
> > +
> > +AT_CHECK([test $gw2_claim_ct = `cat gw2/ovn-controller.log | \
> > +grep -c "cr-alice: Claiming"`])
> > +
> > +test_ip_packet gw1 gw2
>
> The test hangs here for more than a minute because it tries to truncate
> the pcap files for the dead ovs-vswitchd:
>
> 2019-02-04T13:06:47Z|00002|fatal_signal|WARN|terminating with signal 14
> (Alarm clock)
> /tests/testsuite.dir/at-groups/2740/test-source: line 140: 21249 Alarm
> clock
> ovs-vsctl -- set Interface $iface options:tx_pcap=dummy-tx.pcap
> options:rxq_pcap=dummy-rx.pcap
> 2019-02-04T13:06:57Z|00002|fatal_signal|WARN|terminating with signal 14
> (Alarm clock)
> /tests/testsuite.dir/at-groups/2740/test-source: line 140: 21264 Alarm
> clock
> ovs-vsctl -- set Interface $iface options:tx_pcap=${pcap_file}-tx.pcap
> options:rxq_pcap=${pcap_file}-rx.pcap
>
> After that 'utilities/ovs-pcap.in' fails too:
> ./ovn.at:8040: grep $expected packets | sort
> Traceback (most recent call last):
>   File "../../../utilities/ovs-pcap.in", line 99, in <module>
>     reader = PcapReader(args[0])
>   File "../../../utilities/ovs-pcap.in", line 31, in __init__
>     self.file = open(file_name, "rb")
> IOError: [Errno 2] No such file or directory: 'gw2/br-phys_n1-tx.pcap'
>
>
> These does not affect the test results but significantly affects test time.
> Is it possible to rewrite the test to avoid waiting for the dead
> ovs-vswitchd ?
>

It makes sense to not wait if vswitchd is dead. I will address it.


>
> > +
> > +as gw2
> > +ovs-appctl -t ovn-controller exit
> > +ovs-appctl -t ovs-vswitchd exit
>
> ovs-vswitchd should be already dead. Did you mean 'ovsdb-server' ?
>

Yes, I meant ovsdb-server. Thanks.

Thanks for the review and comments. I will update v3 soon.

Numan


> Also, OVS_APP_EXIT_AND_WAIT should be here.
>
> > +
> > +OVN_CLEANUP([hv1],[gw1],[ext1])
> > +
> >  AT_CLEANUP
> >
> >  AT_SETUP([ovn -- 4 HV, 3 LS, 2 LR, packet test with HA distributed
> router gateway port])
> >
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to