Bankim Bhavsar created KUDU-3312:
------------------------------------
Summary: SetPermanentUuidForRemotePeer() isn't resilient to DNS
resolution failure
Key: KUDU-3312
URL: https://issues.apache.org/jira/browse/KUDU-3312
Project: Kudu
Issue Type: Improvement
Components: consensus, master
Reporter: Bankim Bhavsar
When bringing up a new Kudu cluster with multiple masters, these masters must
be brought up together and should start within a short time window of 30 secs
(FLAGS_raft_get_node_instance_timeout_ms)
However bringing up multiple masters on Kubernetes noticed that bring up of
multiple masters fail sometimes since masters aren't brought up together within
a short time window. Simply configuring FLAGS_raft_get_node_instance_timeout_ms
to a higher timeout didn't help in some cases as the DNS resolution would fail
in SetPermanentUuidForRemotePeer() at the very beginning.
{code}
E0827 19:28:53.052981 91 master.cc:279] Unable to init master catalog manager:
Network error: Unable to initialize catalog manager: Failed to initialize sys
tables async: Failed to create new distributed │ │ Raft config: Unable to
resolve UUID for peer member_type: VOTER last_known_addr \{ host:
"kudu-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local" port:
7051 }: unable to resolve address for ku │ │
du-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local: Name or
service not known
{code}
So the function SetPermanentUuidForRemotePeer() needs to be retry for proxy
creation/DNS failure in addition to RPC request.
https://github.com/apache/kudu/blob/master/src/kudu/consensus/consensus_peers.cc#L627
--
This message was sent by Atlassian Jira
(v8.3.4#803005)