[
https://issues.apache.org/jira/browse/KUDU-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bankim Bhavsar updated KUDU-3312:
---------------------------------
Description:
When bringing up a new Kudu cluster with multiple masters, these masters must
be brought up together and should start within a short time window of 30 secs
(FLAGS_raft_get_node_instance_timeout_ms)
However while bringing up multiple masters on Kubernetes noticed that the bring
up fails sometimes since masters aren't brought up together within a short time
window. Simply configuring FLAGS_raft_get_node_instance_timeout_ms to a higher
timeout didn't help in some cases as the DNS resolution would fail in
SetPermanentUuidForRemotePeer() at the very beginning.
{code}
E0827 19:28:53.052981 91 master.cc:279] Unable to init master catalog manager:
Network error: Unable to initialize catalog manager: Failed to initialize sys
tables async: Failed to create new distributed │ │ Raft config: Unable to
resolve UUID for peer member_type: VOTER last_known_addr \{ host:
"kudu-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local" port:
7051 }: unable to resolve address for ku │ │
du-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local: Name or
service not known
{code}
So the function SetPermanentUuidForRemotePeer() needs to retry for proxy
creation/DNS failure in addition to RPC request.
https://github.com/apache/kudu/blob/master/src/kudu/consensus/consensus_peers.cc#L627
was:
When bringing up a new Kudu cluster with multiple masters, these masters must
be brought up together and should start within a short time window of 30 secs
(FLAGS_raft_get_node_instance_timeout_ms)
However while bringing up multiple masters on Kubernetes noticed that the bring
up fails sometimes since masters aren't brought up together within a short time
window. Simply configuring FLAGS_raft_get_node_instance_timeout_ms to a higher
timeout didn't help in some cases as the DNS resolution would fail in
SetPermanentUuidForRemotePeer() at the very beginning.
{code}
E0827 19:28:53.052981 91 master.cc:279] Unable to init master catalog manager:
Network error: Unable to initialize catalog manager: Failed to initialize sys
tables async: Failed to create new distributed │ │ Raft config: Unable to
resolve UUID for peer member_type: VOTER last_known_addr \{ host:
"kudu-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local" port:
7051 }: unable to resolve address for ku │ │
du-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local: Name or
service not known
{code}
So the function SetPermanentUuidForRemotePeer() needs to be retry for proxy
creation/DNS failure in addition to RPC request.
https://github.com/apache/kudu/blob/master/src/kudu/consensus/consensus_peers.cc#L627
> SetPermanentUuidForRemotePeer() isn't resilient to DNS resolution failure
> -------------------------------------------------------------------------
>
> Key: KUDU-3312
> URL: https://issues.apache.org/jira/browse/KUDU-3312
> Project: Kudu
> Issue Type: Improvement
> Components: consensus, master
> Reporter: Bankim Bhavsar
> Priority: Major
>
> When bringing up a new Kudu cluster with multiple masters, these masters must
> be brought up together and should start within a short time window of 30 secs
> (FLAGS_raft_get_node_instance_timeout_ms)
> However while bringing up multiple masters on Kubernetes noticed that the
> bring up fails sometimes since masters aren't brought up together within a
> short time window. Simply configuring FLAGS_raft_get_node_instance_timeout_ms
> to a higher timeout didn't help in some cases as the DNS resolution would
> fail in SetPermanentUuidForRemotePeer() at the very beginning.
> {code}
> E0827 19:28:53.052981 91 master.cc:279] Unable to init master catalog
> manager: Network error: Unable to initialize catalog manager: Failed to
> initialize sys tables async: Failed to create new distributed │ │ Raft
> config: Unable to resolve UUID for peer member_type: VOTER last_known_addr \{
> host:
> "kudu-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local"
> port: 7051 }: unable to resolve address for ku │ │
> du-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local: Name or
> service not known
> {code}
> So the function SetPermanentUuidForRemotePeer() needs to retry for proxy
> creation/DNS failure in addition to RPC request.
> https://github.com/apache/kudu/blob/master/src/kudu/consensus/consensus_peers.cc#L627
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)