On Mon, Feb 4, 2019 at 6:52 PM Ilya Maximets <[email protected]> wrote:
> On 04.02.2019 13:39, [email protected] wrote: > > From: Numan Siddique <[email protected]> > > > > On a chassis when ovs-vswitchd crashes for some reason, the BFD status > doesn't > > get updated in the ovs db. ovn-controller will be reading the old BFD > status > > even though ovs-vswitchd is crashed. This results in the chassiredirect > port > > claim flapping between the master chassis and the chasiss with the next > higher > > priority if ovs-vswitchd crashes in the master chassis. > > > > All the other chassis notices the BFD status down with the master chassis > > and hence the next higher priority claims the port. But according to > > the master chassis, the BFD status is fine and it again claims back the > > chassisredirect port. And this results in flapping. The issue gets > resolved > > when ovs-vswitchd comes back but until then it leads to lot of SB DB > > transactions and high CPU usage in ovn-controller's. > > > > This patch fixes the issue by checking the OF connection status of the > > ovn-controller with ovs-vswitchd and calculates the active bfd tunnels > > only if it's connected. > > > > Signed-off-by: Numan Siddique <[email protected]> > > --- > > > > v1 -> v2 > > ----- > > * Deleted unnecessary prints in the test case which were added during > > debugging. > > > > ovn/controller/ofctrl.c | 6 ++++++ > > ovn/controller/ofctrl.h | 3 +++ > > ovn/controller/ovn-controller.c | 13 ++++++++++++- > > tests/ovn.at | 26 +++++++++++++++++++++++++- > > 4 files changed, 46 insertions(+), 2 deletions(-) > > > > diff --git a/ovn/controller/ofctrl.c b/ovn/controller/ofctrl.c > > index 218612787..95b95b607 100644 > > --- a/ovn/controller/ofctrl.c > > +++ b/ovn/controller/ofctrl.c > > @@ -1265,3 +1265,9 @@ ofctrl_inject_pkt(const struct ovsrec_bridge > *br_int, const char *flow_s, > > > > return NULL; > > } > > + > > +bool > > +ofctrl_is_connected(void) > > +{ > > + return rconn_is_connected(swconn); > > +} > > diff --git a/ovn/controller/ofctrl.h b/ovn/controller/ofctrl.h > > index ae0cfa513..f7521801b 100644 > > --- a/ovn/controller/ofctrl.h > > +++ b/ovn/controller/ofctrl.h > > @@ -60,4 +60,7 @@ void ofctrl_check_and_add_flow(struct hmap > *desired_flows, uint8_t table_id, > > const struct match *, > > const struct ofpbuf *ofpacts, > > bool log_duplicate_flow); > > + > > +bool ofctrl_is_connected(void); > > + > > #endif /* ovn/ofctrl.h */ > > diff --git a/ovn/controller/ovn-controller.c > b/ovn/controller/ovn-controller.c > > index 4e9a5865f..2098f280c 100644 > > --- a/ovn/controller/ovn-controller.c > > +++ b/ovn/controller/ovn-controller.c > > @@ -689,7 +689,18 @@ main(int argc, char *argv[]) > > ovsrec_bridge_table_get(ovs_idl_loop.idl), br_int, > > sbrec_chassis_table_get(ovnsb_idl_loop.idl), > chassis_id, > > sbrec_sb_global_first(ovnsb_idl_loop.idl)); > > - bfd_calculate_active_tunnels(br_int, &active_tunnels); > > + > > + if (ofctrl_is_connected()) { > > + /* Calculate the active tunnels only if have an an > active > > + * OpenFlow connection to br-int. > > + * If we don't have a connection to br-int, it > could mean > > + * ovs-vswitchd is down for some reason and the BFD > status > > + * in the Interface rows could be stale. So its > better to > > + * consider 'active_tunnels' set to be empty if > it's not > > + * connected. */ > > + bfd_calculate_active_tunnels(br_int, > &active_tunnels); > > + } > > + > > binding_run(ovnsb_idl_txn, ovs_idl_txn, > sbrec_chassis_by_name, > > sbrec_datapath_binding_by_key, > > sbrec_port_binding_by_datapath, > > diff --git a/tests/ovn.at b/tests/ovn.at > > index f54f24c74..fd558cb98 100644 > > --- a/tests/ovn.at > > +++ b/tests/ovn.at > > @@ -8055,7 +8055,31 @@ ovn-nbctl --timeout=3 --wait=hv \ > > > > test_ip_packet gw2 gw1 > > > > -OVN_CLEANUP([hv1],[gw1],[gw2],[ext1]) > > +# Get the claim count of both gw1 and gw2. > > +gw1_claim_ct=`grep "cr-alice: Claiming" gw1/ovn-controller.log | wc -l` > > +gw2_claim_ct=`grep "cr-alice: Claiming" gw2/ovn-controller.log | wc -l` > > + > > +# Kill ovs-vswitchd in gw2. gw1 should claim the gateway port. > > +kill `cat gw2/ovs-vswitchd.pid` > > Is it necessary to hard kill the daemon ? > Test works fine with OVS_APP_EXIT_AND_WAIT([ovs-vswitchd]) too. > I tried this but it didn't work for me. I will test again and update it. > Anyway, simple 'kill' is not a hard kill. It's trapped by OVS and > OVS executes some at_exit handlers. For example, pid file is removed, > so, the below 'appctl exit' command always fails: > > 2019-02-04T13:06:58Z|00001|daemon_unix|WARN|/tests/testsuite.dir/2740/gw2/ovs-vswitchd.pid: > open: No such file or directory > ovs-appctl: cannot read pidfile > "/tests/testsuite.dir/2740/gw2/ovs-vswitchd.pid" (No such file or directory) > > > + > > +# gw1 should claim the cr-alice and the claim count of gw1 should be > > +# incremented by 1. > > +gw1_claim_ct=$((gw1_claim_ct+1)) > > + > > +OVS_WAIT_UNTIL([test $gw1_claim_ct = `cat gw1/ovn-controller.log \ > > +| grep -c "cr-alice: Claiming"`]) > > + > > +AT_CHECK([test $gw2_claim_ct = `cat gw2/ovn-controller.log | \ > > +grep -c "cr-alice: Claiming"`]) > > + > > +test_ip_packet gw1 gw2 > > The test hangs here for more than a minute because it tries to truncate > the pcap files for the dead ovs-vswitchd: > > 2019-02-04T13:06:47Z|00002|fatal_signal|WARN|terminating with signal 14 > (Alarm clock) > /tests/testsuite.dir/at-groups/2740/test-source: line 140: 21249 Alarm > clock > ovs-vsctl -- set Interface $iface options:tx_pcap=dummy-tx.pcap > options:rxq_pcap=dummy-rx.pcap > 2019-02-04T13:06:57Z|00002|fatal_signal|WARN|terminating with signal 14 > (Alarm clock) > /tests/testsuite.dir/at-groups/2740/test-source: line 140: 21264 Alarm > clock > ovs-vsctl -- set Interface $iface options:tx_pcap=${pcap_file}-tx.pcap > options:rxq_pcap=${pcap_file}-rx.pcap > > After that 'utilities/ovs-pcap.in' fails too: > ./ovn.at:8040: grep $expected packets | sort > Traceback (most recent call last): > File "../../../utilities/ovs-pcap.in", line 99, in <module> > reader = PcapReader(args[0]) > File "../../../utilities/ovs-pcap.in", line 31, in __init__ > self.file = open(file_name, "rb") > IOError: [Errno 2] No such file or directory: 'gw2/br-phys_n1-tx.pcap' > > > These does not affect the test results but significantly affects test time. > Is it possible to rewrite the test to avoid waiting for the dead > ovs-vswitchd ? > It makes sense to not wait if vswitchd is dead. I will address it. > > > + > > +as gw2 > > +ovs-appctl -t ovn-controller exit > > +ovs-appctl -t ovs-vswitchd exit > > ovs-vswitchd should be already dead. Did you mean 'ovsdb-server' ? > Yes, I meant ovsdb-server. Thanks. Thanks for the review and comments. I will update v3 soon. Numan > Also, OVS_APP_EXIT_AND_WAIT should be here. > > > + > > +OVN_CLEANUP([hv1],[gw1],[ext1]) > > + > > AT_CLEANUP > > > > AT_SETUP([ovn -- 4 HV, 3 LS, 2 LR, packet test with HA distributed > router gateway port]) > > > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
