Hi all, I have an issue with ovsdb in cluster mode when an instance of a db server fails.
I'm running a HA single-stack IPv6 ovn-kubernetes Kind cluster, where we have ovnnb_db and ovnsb_db replicated on three nodes. All control traffic is IPv6. Then I take one node, I delete the db files, and I also delete the pod itself that holds the db server, so as to simulate a node failure. The pod is recreated as well as the db files, but "ovs-appctl cluster/status OVN_Northbound" still shows the *old* server instance, along with the new one. Indeed, when I look at the ovsdb-server-nb debug logs on the affected node, I see that it is still receiving heartbeat messages to both the new server (to which it correctly replies) and the old now (for which it raises an error: "syntax error: Parsing raft append_request RPC failed: misrouted message (addressed to 0227 but we're bcda"). On the other hand, in an HA single-stack IPv4 cluster, everything works as expected: 1) during a few tens of seconds, the cluster/status command from above shows the old and the new server, as in the ipv6 case; 2) then, the old server is removed, as the new one is correctly added to the cluster. This is confirmed in ovsdb-server-nb.logs, where I see the remove_server_request and remove_server_reply messages. However, in a HA IPv6 cluster, I keep seeing 4 servers and no "remove_server_*" messages in the logs... so it's stuck in the first point from above. Is this a bug? Is there anything I can do to debug this further? Thanks! Riccardo _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
