Bankim Bhavsar created KUDU-3312:
------------------------------------

             Summary: SetPermanentUuidForRemotePeer() isn't resilient to DNS 
resolution failure
                 Key: KUDU-3312
                 URL: https://issues.apache.org/jira/browse/KUDU-3312
             Project: Kudu
          Issue Type: Improvement
          Components: consensus, master
            Reporter: Bankim Bhavsar


When bringing up a new Kudu cluster with multiple masters, these masters must 
be brought up together and should start within a short time window of 30 secs 
(FLAGS_raft_get_node_instance_timeout_ms)

However bringing up multiple masters on Kubernetes noticed that bring up of 
multiple masters fail sometimes since masters aren't brought up together within 
a short time window. Simply configuring FLAGS_raft_get_node_instance_timeout_ms 
to a higher timeout didn't help in some cases as the DNS resolution would fail 
in SetPermanentUuidForRemotePeer() at the very beginning.

{code}
 E0827 19:28:53.052981 91 master.cc:279] Unable to init master catalog manager: 
Network error: Unable to initialize catalog manager: Failed to initialize sys 
tables async: Failed to create new distributed │ │ Raft config: Unable to 
resolve UUID for peer member_type: VOTER last_known_addr \{ host: 
"kudu-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local" port: 
7051 }: unable to resolve address for ku │ │ 
du-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local: Name or 
service not known
{code}

So the function SetPermanentUuidForRemotePeer() needs to be retry for proxy 
creation/DNS failure in addition to RPC request.
https://github.com/apache/kudu/blob/master/src/kudu/consensus/consensus_peers.cc#L627
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to