Ali, sorry if I misunderstand what you are saying, but pacemaker here is for northd HA. pacemaker itself won't point to any ovsdb cluster node. All northds can point to a LB VIP for the ovsdb cluster, so if a member of ovsdb cluster is down it won't have impact to northd.
Without clustering support of the ovsdb lock, I think this is what we have now for northd HA. Please suggest if anyone has any other idea. Thanks :) On Wed, Mar 21, 2018 at 1:12 PM, aginwala <aginw...@asu.edu> wrote: > :) The only thing is while using pacemaker, if the node that pacemaker if > pointing to is down, all the active/standby northd nodes have to be updated > to new node from the cluster. But will dig in more to see what else I can > find. > > @Ben: Any suggestions further? > > > Regards, > > On Wed, Mar 21, 2018 at 10:22 AM, Han Zhou <zhou...@gmail.com> wrote: > >> >> >> On Wed, Mar 21, 2018 at 9:49 AM, aginwala <aginw...@asu.edu> wrote: >> >>> Thanks Numan: >>> >>> Yup agree with the locking part. For now; yes I am running northd on one >>> node. I might right a script to monitor northd in cluster so that if the >>> node where it's running goes down, script can spin up northd on one other >>> active nodes as a dirty hack. >>> >>> The "dirty hack" is pacemaker :) >> >> >>> Sure, will await for the inputs from Ben too on this and see how complex >>> would it be to roll out this feature. >>> >>> >>> Regards, >>> >>> >>> On Wed, Mar 21, 2018 at 5:43 AM, Numan Siddique <nusid...@redhat.com> >>> wrote: >>> >>>> Hi Aliasgar, >>>> >>>> ovsdb-server maintains locks per each connection and not across the db. >>>> A workaround for you now would be to configure all the ovn-northd instances >>>> to connect to one ovsdb-server if you want to have active/standy. >>>> >>>> Probably Ben can answer if there is a plan to support ovsdb locks >>>> across the db. We also need this support in networking-ovn as it also uses >>>> ovsdb locks. >>>> >>>> Thanks >>>> Numan >>>> >>>> >>>> On Wed, Mar 21, 2018 at 1:40 PM, aginwala <aginw...@asu.edu> wrote: >>>> >>>>> Hi Numan: >>>>> >>>>> Just figured out that ovn-northd is running as active on all 3 nodes >>>>> instead of one active instance as I continued to test further which >>>>> results >>>>> in db errors as per logs. >>>>> >>>>> >>>>> # on node 3, I run ovn-nbctl ls-add ls2 ; it populates below logs in >>>>> ovn-north >>>>> 2018-03-21T06:01:59.442Z|00007|ovsdb_idl|WARN|transaction error: >>>>> {"details":"Transaction causes multiple rows in \"Datapath_Binding\" table >>>>> to have identical values (1) for index on column \"tunnel_key\". First >>>>> row, with UUID 8c5d9342-2b90-4229-8ea1-001a733a915c, was inserted by >>>>> this transaction. Second row, with UUID >>>>> 8e06f919-4cc7-4ffc-9a79-20ce6663b683, >>>>> existed in the database before this transaction and was not modified by >>>>> the >>>>> transaction.","error":"constraint violation"} >>>>> >>>>> In southbound datapath list, 2 duplicate records gets created for same >>>>> switch. >>>>> >>>>> # ovn-sbctl list Datapath >>>>> _uuid : b270ae30-3458-445f-95d2-b14e8ebddd01 >>>>> external_ids : >>>>> {logical-switch="4d6674e3-ff9f-4f38-b050-0fa9bec9e34d", >>>>> name="ls2"} >>>>> tunnel_key : 2 >>>>> >>>>> _uuid : 8e06f919-4cc7-4ffc-9a79-20ce6663b683 >>>>> external_ids : >>>>> {logical-switch="4d6674e3-ff9f-4f38-b050-0fa9bec9e34d", >>>>> name="ls2"} >>>>> tunnel_key : 1 >>>>> >>>>> >>>>> >>>>> # on nodes 1 and 2 where northd is running, it gives below error: >>>>> 2018-03-21T06:01:59.437Z|00008|ovsdb_idl|WARN|transaction error: >>>>> {"details":"cannot delete Datapath_Binding row >>>>> 8e06f919-4cc7-4ffc-9a79-20ce6663b683 because of 17 remaining >>>>> reference(s)","error":"referential integrity violation"} >>>>> >>>>> As per commit message, for northd I re-tried setting --ovnnb-db="tcp: >>>>> 10.169.125.152:6641,tcp:10.169.125.131:6641,tcp:10.148.181.162:6641" >>>>> and --ovnsb-db="tcp:10.169.125.152:6642,tcp:10.169.125.131:6642,tcp: >>>>> 10.148.181.162:6642" and it did not help either. >>>>> >>>>> There is no issue if I keep running only one instance of northd on any >>>>> of these 3 nodes. Hence, wanted to know is there something else >>>>> missing here to make only one northd instance as active and rest as >>>>> standby? >>>>> >>>>> >>>>> Regards, >>>>> >>>>> On Thu, Mar 15, 2018 at 3:09 AM, Numan Siddique <nusid...@redhat.com> >>>>> wrote: >>>>> >>>>>> That's great >>>>>> >>>>>> Numan >>>>>> >>>>>> >>>>>> On Thu, Mar 15, 2018 at 2:57 AM, aginwala <aginw...@asu.edu> wrote: >>>>>> >>>>>>> Hi Numan: >>>>>>> >>>>>>> I tried on new nodes (kernel : 4.4.0-104-generic , Ubuntu 16.04)with >>>>>>> fresh installation and it worked super fine for both sb and nb dbs. >>>>>>> Seems >>>>>>> like some kernel issue on the previous nodes when I re-installed raft >>>>>>> patch >>>>>>> as I was running different ovs version on those nodes before. >>>>>>> >>>>>>> >>>>>>> For 2 HVs, I now set ovn-remote="tcp:10.169.125.152:6642, tcp: >>>>>>> 10.169.125.131:6642, tcp:10.148.181.162:6642" and started >>>>>>> controller and it works super fine. >>>>>>> >>>>>>> >>>>>>> Did some failover testing by rebooting/killing the leader ( >>>>>>> 10.169.125.152) and bringing it back up and it works as expected. >>>>>>> Nothing weird noted so far. >>>>>>> >>>>>>> # check-cluster gives below data one of the node(10.148.181.162) post >>>>>>> leader failure >>>>>>> >>>>>>> ovsdb-tool check-cluster /etc/openvswitch/ovnsb_db.db >>>>>>> ovsdb-tool: leader /etc/openvswitch/ovnsb_db.db for term 2 has log >>>>>>> entries only up to index 18446744073709551615, but index 9 was >>>>>>> committed in >>>>>>> a previous term (e.g. by /etc/openvswitch/ovnsb_db.db) >>>>>>> >>>>>>> >>>>>>> For check-cluster, are we planning to add more output showing which >>>>>>> node is active(leader), etc in upcoming versions ? >>>>>>> >>>>>>> >>>>>>> Thanks a ton for helping sort this out. I think the patch looks >>>>>>> good to be merged post addressing of the comments by Justin along with >>>>>>> the >>>>>>> man page details for ovsdb-tool. >>>>>>> >>>>>>> >>>>>>> I will do some more crash testing for the cluster along with the >>>>>>> scale test and keep you posted if something unexpected is noted. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Mar 13, 2018 at 11:07 PM, Numan Siddique < >>>>>>> nusid...@redhat.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Mar 14, 2018 at 7:51 AM, aginwala <aginw...@asu.edu> wrote: >>>>>>>> >>>>>>>>> Sure. >>>>>>>>> >>>>>>>>> To add on , I also ran for nb db too using different port and >>>>>>>>> Node2 crashes with same error : >>>>>>>>> # Node 2 >>>>>>>>> /usr/share/openvswitch/scripts/ovn-ctl --db-nb-addr=10.99.152.138 >>>>>>>>> --db-nb-port=6641 --db-nb-cluster-remote-addr="tcp: >>>>>>>>> 10.99.152.148:6645" --db-nb-cluster-local-addr="tcp: >>>>>>>>> 10.99.152.138:6645" start_nb_ovsdb >>>>>>>>> ovsdb-server: ovsdb error: /etc/openvswitch/ovnnb_db.db: cannot >>>>>>>>> identify file type >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> Hi Aliasgar, >>>>>>>> >>>>>>>> It worked for me. Can you delete the old db files in >>>>>>>> /etc/openvswitch/ and try running the commands again ? >>>>>>>> >>>>>>>> Below are the commands I ran in my setup. >>>>>>>> >>>>>>>> Node 1 >>>>>>>> ------- >>>>>>>> sudo /usr/share/openvswitch/scripts/ovn-ctl >>>>>>>> --db-sb-addr=192.168.121.91 --db-sb-port=6642 >>>>>>>> --db-sb-create-insecure-remote=yes >>>>>>>> --db-sb-cluster-local-addr=tcp:192.168.121.91:6644 start_sb_ovsdb >>>>>>>> >>>>>>>> Node 2 >>>>>>>> --------- >>>>>>>> sudo /usr/share/openvswitch/scripts/ovn-ctl >>>>>>>> --db-sb-addr=192.168.121.87 --db-sb-port=6642 >>>>>>>> --db-sb-create-insecure-remote=yes >>>>>>>> --db-sb-cluster-local-addr="tcp:192.168.121.87:6644" >>>>>>>> --db-sb-cluster-remote-addr="tcp:192.168.121.91:6644" >>>>>>>> start_sb_ovsdb >>>>>>>> >>>>>>>> Node 3 >>>>>>>> --------- >>>>>>>> sudo /usr/share/openvswitch/scripts/ovn-ctl >>>>>>>> --db-sb-addr=192.168.121.78 --db-sb-port=6642 >>>>>>>> --db-sb-create-insecure-remote=yes >>>>>>>> --db-sb-cluster-local-addr="tcp:192.168.121.78:6644" >>>>>>>> --db-sb-cluster-remote-addr="tcp:192.168.121.91:6644" >>>>>>>> start_sb_ovsdb >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Thanks >>>>>>>> Numan >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Mar 13, 2018 at 9:40 AM, Numan Siddique < >>>>>>>>> nusid...@redhat.com> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Mar 13, 2018 at 9:46 PM, aginwala <aginw...@asu.edu> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks Numan for the response. >>>>>>>>>>> >>>>>>>>>>> There is no command start_cluster_sb_ovsdb in the source code >>>>>>>>>>> too. Is that in a separate commit somewhere? Hence, I used >>>>>>>>>>> start_sb_ovsdb >>>>>>>>>>> which I think would not be a right choice? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Sorry, I meant start_sb_ovsdb. Strange that it didn't work for >>>>>>>>>> you. Let me try it out again and update this thread. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Numan >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> # Node1 came up as expected. >>>>>>>>>>> ovn-ctl --db-sb-addr=10.99.152.148 --db-sb-port=6642 >>>>>>>>>>> --db-sb-create-insecure-remote=yes --db-sb-cluster-local-addr=" >>>>>>>>>>> tcp:10.99.152.148:6644" start_sb_ovsdb. >>>>>>>>>>> >>>>>>>>>>> # verifying its a clustered db with ovsdb-tool db-local-address >>>>>>>>>>> /etc/openvswitch/ovnsb_db.db >>>>>>>>>>> tcp:10.99.152.148:6644 >>>>>>>>>>> # ovn-sbctl show works fine and chassis are being populated >>>>>>>>>>> correctly. >>>>>>>>>>> >>>>>>>>>>> #Node 2 fails with error: >>>>>>>>>>> /usr/share/openvswitch/scripts/ovn-ctl >>>>>>>>>>> --db-sb-addr=10.99.152.138 --db-sb-port=6642 >>>>>>>>>>> --db-sb-create-insecure-remote=yes >>>>>>>>>>> --db-sb-cluster-remote-addr="tcp:10.99.152.148:6644" >>>>>>>>>>> --db-sb-cluster-local-addr="tcp:10.99.152.138:6644" >>>>>>>>>>> start_sb_ovsdb >>>>>>>>>>> ovsdb-server: ovsdb error: /etc/openvswitch/ovnsb_db.db: cannot >>>>>>>>>>> identify file type >>>>>>>>>>> >>>>>>>>>>> # So i did start the sb db the usual way using start_ovsdb to >>>>>>>>>>> just get the db file created and killed the sb pid and re-ran the >>>>>>>>>>> command >>>>>>>>>>> which gave actual error where it complains for join-cluster command >>>>>>>>>>> that is >>>>>>>>>>> being called internally >>>>>>>>>>> /usr/share/openvswitch/scripts/ovn-ctl >>>>>>>>>>> --db-sb-addr=10.99.152.138 --db-sb-port=6642 >>>>>>>>>>> --db-sb-create-insecure-remote=yes >>>>>>>>>>> --db-sb-cluster-remote-addr="tcp:10.99.152.148:6644" >>>>>>>>>>> --db-sb-cluster-local-addr="tcp:10.99.152.138:6644" >>>>>>>>>>> start_sb_ovsdb >>>>>>>>>>> ovsdb-tool: /etc/openvswitch/ovnsb_db.db: not a clustered >>>>>>>>>>> database >>>>>>>>>>> * Backing up database to /etc/openvswitch/ovnsb_db.db.b >>>>>>>>>>> ackup1.15.0-70426956 >>>>>>>>>>> ovsdb-tool: 'join-cluster' command requires at least 4 arguments >>>>>>>>>>> * Creating cluster database /etc/openvswitch/ovnsb_db.db from >>>>>>>>>>> existing one >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> # based on above error I killed the sb db pid again and try to >>>>>>>>>>> create a local cluster on node then re-ran the join operation as >>>>>>>>>>> per the >>>>>>>>>>> source code function. >>>>>>>>>>> ovsdb-tool join-cluster /etc/openvswitch/ovnsb_db.db >>>>>>>>>>> OVN_Southbound tcp:10.99.152.138:6644 tcp:10.99.152.148:6644 >>>>>>>>>>> which still complains >>>>>>>>>>> ovsdb-tool: I/O error: /etc/openvswitch/ovnsb_db.db: create >>>>>>>>>>> failed (File exists) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> # Node 3: I did not try as I am assuming the same failure as >>>>>>>>>>> node 2 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Let me know may know further. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Mar 13, 2018 at 3:08 AM, Numan Siddique < >>>>>>>>>>> nusid...@redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Aliasgar, >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Mar 13, 2018 at 7:11 AM, aginwala <aginw...@asu.edu> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Ben/Noman: >>>>>>>>>>>>> >>>>>>>>>>>>> I am trying to setup 3 node southbound db cluster using >>>>>>>>>>>>> raft10 <https://patchwork.ozlabs.org/patch/854298/> in >>>>>>>>>>>>> review. >>>>>>>>>>>>> >>>>>>>>>>>>> # Node 1 create-cluster >>>>>>>>>>>>> ovsdb-tool create-cluster /etc/openvswitch/ovnsb_db.db >>>>>>>>>>>>> /root/ovs-reviews/ovn/ovn-sb.ovsschema tcp:10.99.152.148:6642 >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> A different port is used for RAFT. So you have to choose >>>>>>>>>>>> another port like 6644 for example. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> # Node 2 >>>>>>>>>>>>> ovsdb-tool join-cluster /etc/openvswitch/ovnsb_db.db >>>>>>>>>>>>> OVN_Southbound tcp:10.99.152.138:6642 tcp:10.99.152.148:6642 --cid >>>>>>>>>>>>> 5dfcb678-bb1d-4377-b02d-a380edec2982 >>>>>>>>>>>>> >>>>>>>>>>>>> #Node 3 >>>>>>>>>>>>> ovsdb-tool join-cluster /etc/openvswitch/ovnsb_db.db >>>>>>>>>>>>> OVN_Southbound tcp:10.99.152.101:6642 tcp:10.99.152.138:6642 >>>>>>>>>>>>> tcp:10.99.152.148:6642 --cid 5dfcb678-bb1d-4377-b02d-a380ed >>>>>>>>>>>>> ec2982 >>>>>>>>>>>>> >>>>>>>>>>>>> # ovn remote is set to all 3 nodes >>>>>>>>>>>>> external_ids:ovn-remote="tcp:10.99.152.148:6642, tcp: >>>>>>>>>>>>> 10.99.152.138:6642, tcp:10.99.152.101:6642" >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> # Starting sb db on node 1 using below command on node 1: >>>>>>>>>>>>> >>>>>>>>>>>>> ovsdb-server --detach --monitor -vconsole:off -vraft -vjsonrpc >>>>>>>>>>>>> --log-file=/var/log/openvswitch/ovsdb-server-sb.log >>>>>>>>>>>>> --pidfile=/var/run/openvswitch/ovnsb_db.pid >>>>>>>>>>>>> --remote=db:OVN_Southbound,SB_Global,connections >>>>>>>>>>>>> --unixctl=ovnsb_db.ctl >>>>>>>>>>>>> --private-key=db:OVN_Southbound,SSL,private_key >>>>>>>>>>>>> --certificate=db:OVN_Southbound,SSL,certificate >>>>>>>>>>>>> --ca-cert=db:OVN_Southbound,SSL,ca_cert >>>>>>>>>>>>> --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols >>>>>>>>>>>>> --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers >>>>>>>>>>>>> --remote=punix:/var/run/openvswitch/ovnsb_db.sock >>>>>>>>>>>>> /etc/openvswitch/ovnsb_db.db >>>>>>>>>>>>> >>>>>>>>>>>>> # check-cluster is returning nothing >>>>>>>>>>>>> ovsdb-tool check-cluster /etc/openvswitch/ovnsb_db.db >>>>>>>>>>>>> >>>>>>>>>>>>> # ovsdb-server-sb.log below shows the leader is elected with >>>>>>>>>>>>> only one server and there are rbac related debug logs with rpc >>>>>>>>>>>>> replies and >>>>>>>>>>>>> empty params with no errors >>>>>>>>>>>>> >>>>>>>>>>>>> 2018-03-13T01:12:02Z|00002|raft|DBG|server 63d1 added to >>>>>>>>>>>>> configuration >>>>>>>>>>>>> 2018-03-13T01:12:02Z|00003|raft|INFO|term 6: starting election >>>>>>>>>>>>> 2018-03-13T01:12:02Z|00004|raft|INFO|term 6: elected leader >>>>>>>>>>>>> by 1+ of 1 servers >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Now Starting the ovsdb-server on the other clusters fails >>>>>>>>>>>>> saying >>>>>>>>>>>>> ovsdb-server: ovsdb error: /etc/openvswitch/ovnsb_db.db: >>>>>>>>>>>>> cannot identify file type >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Also noticed that man ovsdb-tool is missing cluster details. >>>>>>>>>>>>> Might want to address it in the same patch or different. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Please advise to what is missing here for running ovn-sbctl >>>>>>>>>>>>> show as this command hangs. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I think you can use the ovn-ctl command >>>>>>>>>>>> "start_cluster_sb_ovsdb" for your testing (atleast for now) >>>>>>>>>>>> >>>>>>>>>>>> For your setup, I think you can start the cluster as >>>>>>>>>>>> >>>>>>>>>>>> # Node 1 >>>>>>>>>>>> ovn-ctl --db-sb-addr=10.99.152.148 --db-sb-port=6642 >>>>>>>>>>>> --db-sb-create-insecure-remote=yes >>>>>>>>>>>> --db-sb-cluster-local-addr="tcp:10.99.152.148:6644" >>>>>>>>>>>> start_cluster_sb_ovsdb >>>>>>>>>>>> >>>>>>>>>>>> # Node 2 >>>>>>>>>>>> ovn-ctl --db-sb-addr=10.99.152.138 --db-sb-port=6642 >>>>>>>>>>>> --db-sb-create-insecure-remote=yes >>>>>>>>>>>> --db-sb-cluster-local-addr="tcp:10.99.152.138:6644" >>>>>>>>>>>> --db-sb-cluster-remote-addr="tcp:10.99.152.148:6644" >>>>>>>>>>>> start_cluster_sb_ovsdb >>>>>>>>>>>> >>>>>>>>>>>> # Node 3 >>>>>>>>>>>> ovn-ctl --db-sb-addr=10.99.152.101 --db-sb-port=6642 >>>>>>>>>>>> --db-sb-create-insecure-remote=yes >>>>>>>>>>>> --db-sb-cluster-local-addr="tcp:10.99.152.101:6644" >>>>>>>>>>>> --db-sb-cluster-remote-addr="tcp:10.99.152.148:6644" start_c >>>>>>>>>>>> luster_sb_ovsdb >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Let me know how it goes. >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Numan >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> discuss mailing list >>>>>>>>>>>>> disc...@openvswitch.org >>>>>>>>>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> discuss mailing list >>> disc...@openvswitch.org >>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss >>> >>> >> >
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss