[ 
https://issues.apache.org/jira/browse/MESOS-4546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4546:
-------------------------------------
    Assignee: Neil Conway  (was: Artem Harutyunyan)

> Mesos Agents needs to re-resolve hosts in zk string on leader change / 
> failure to connect
> -----------------------------------------------------------------------------------------
>
>                 Key: MESOS-4546
>                 URL: https://issues.apache.org/jira/browse/MESOS-4546
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>            Reporter: Cody Maloney
>            Assignee: Neil Conway
>            Priority: Blocker
>              Labels: mesosphere
>
> Sample Mesos Agent log: 
> https://gist.github.com/brndnmtthws/fb846fa988487250a809
> Note, zookeeper has a function to change the list of servers at runtime: 
> https://github.com/apache/zookeeper/blob/735ea78909e67c648a4978c8d31d63964986af73/src/c/src/zookeeper.c#L1207-L1232
> This comes up when using an AWS AutoScalingGroup for managing the set of 
> masters. 
> The agent when it comes up the first time, resolves the zk:// string. Once 
> all the hosts that were in the original string fail (Each fails, is replaced 
> by a new machine, which has the same DNS name), the agent just keeps spinning 
> in an internal loop, never re-resolving the DNS names.
> Two solutions I see are 
> 1. Update the list of servers / re-resolve
> 2. Have the agent detect it hasn't connected recently, and kill itself (Which 
> will force a re-resolution when the agent starts back up)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to