[
https://issues.apache.org/jira/browse/MESOS-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241325#comment-15241325
]
Stefano commented on MESOS-5207:
--------------------------------
Hi all.
today i have tried to set 2 mesos clusters with one master on a network which
attach to the group of other 2 master on another network, let me explain:
i'm working on OpenStack and i have build come virtual machines and 2 different
networks with it.
I have set two mesos clusters:
NetworkA:
2 mesos master
2 mesos slaves
NetworkB:
1 mesos master
1 mesos slave
I should try to make and interconnection between these two clusters.
I have set zookeeper configurations such that all 3 masters are competing for
he leadership. I show you the main configurations:
NetworkA on both 2 masters:
/etc/zookeeper/conf/zoo.cfg : at the end of the file
server.1=192.168.100.54:2888:3888 (master1 on network A)
server.2=192.168.100.55:2888:3888 (master2 on network A)
server.3=131.154.xxx.xxx:2888:3888 (Master3 on network B, i have set floating
IP)
etc/mesos/zk
zk://192.168.100.54:2181,192.168.100.55:2181,131.154.xxx.xxx:2181/mesos
NetorkB:
/etc/zookeeper/conf/zoo.cfg: at the end of the file:
server.1=131.154.96.27:2888:3888 (master1 on network A, i have set floating IP)
server.2=131.154.96.32:2888:3888 (master2 on network A, i have set floating IP)
server.3=192.168.10.11:2888:3888 (Master3 on network B)
etc/mesos/zk:
zk://131.154.zzz.zzz:2181,131.154.yyy.yyy:2181,192.168.10.11:2181/mesos
I notice this problem:
first of all these 3 masters are working like they are in the same cluster. So
if i put down one of them, with quorum 2, there is the re election.
The problem is that after a while, more or less 1 minute, the current leader
disconnect and then another master take the leadership.
I show you the log related to the master on network B:
Log file created at: 2016/04/14 15:02:18
Running on machine: master3.novalocal
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0414 15:02:18.447484 20410 logging.cpp:188] INFO level logging started!
I0414 15:02:18.447836 20410 main.cpp:230] Build: 2016-03-10 20:32:58 by root
I0414 15:02:18.447854 20410 main.cpp:232] Version: 0.27.2
I0414 15:02:18.447865 20410 main.cpp:235] Git tag: 0.27.2
I0414 15:02:18.447876 20410 main.cpp:239] Git SHA:
3c9ec4a0f34420b7803848af597de00fedefe0e2
I0414 15:02:18.447931 20410 main.cpp:253] Using 'HierarchicalDRF' allocator
I0414 15:02:18.483774 20410 leveldb.cpp:174] Opened db in 35.734219ms
I0414 15:02:18.505858 20410 leveldb.cpp:181] Compacted db in 22.032139ms
I0414 15:02:18.505903 20410 leveldb.cpp:196] Created db iterator in 7982ns
I0414 15:02:18.505930 20410 leveldb.cpp:202] Seeked to beginning of db in 668ns
I0414 15:02:18.505939 20410 leveldb.cpp:271] Iterated through 0 keys in the db
in 470ns
I0414 15:02:18.505988 20410 replica.cpp:779] Replica recovered with log
positions 0 -> 0 with 1 holes and 0 unlearned
I0414 15:02:18.506793 20410 main.cpp:464] Starting Mesos master
I0414 15:02:18.507874 20410 master.cpp:374] Master
de75d47e-1791-4ab7-ac13-7c927873b035 (131.154.96.156) started on
192.168.10.11:5050
I0414 15:02:18.507890 20410 master.cpp:376] Flags at startup:
--allocation_interval="1secs" --allocator="HierarchicalDRF"
--authenticate="false" --authenticate_http="false"
--authenticate_slaves="false" --authenticators="crammd5" --authorizers="local"
--framework_sorter="drf" --help="false" --hostname="131.154.96.156"
--hostname_lookup="true" --http_authenticators="basic"
--initialize_driver_logging="true" --log_auto_initialize="true"
--log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000"
--max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="2"
--recovery_slave_removal_limit="100%" --registry="replicated_log"
--registry_fetch_timeout="1mins" --registry_store_timeout="5secs"
--registry_strict="false" --root_submissions="true"
--slave_ping_timeout="15secs" --slave_reregister_timeout="10mins"
--user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui"
--work_dir="/var/lib/mesos"
--zk="zk://131.154.96.27:2181,131.154.96.32:2181,192.168.10.11:2181/mesos"
--zk_session_timeout="10secs"
I0414 15:02:18.508060 20410 master.cpp:423] Master allowing unauthenticated
frameworks to register
I0414 15:02:18.508070 20410 master.cpp:428] Master allowing unauthenticated
slaves to register
I0414 15:02:18.508097 20410 master.cpp:466] Using default 'crammd5'
authenticator
W0414 15:02:18.508111 20410 authenticator.cpp:511] No credentials provided,
authentication requests will be refused
I0414 15:02:18.508291 20410 authenticator.cpp:518] Initializing server SASL
I0414 15:02:18.509346 20426 log.cpp:236] Attempting to join replica to
ZooKeeper group
I0414 15:02:18.510659 20430 recover.cpp:447] Starting replica recovery
I0414 15:02:18.517371 20431 recover.cpp:473] Replica is in EMPTY status
I0414 15:02:18.518949 20429 master.cpp:1649] Successfully attached file
'/var/log/mesos/mesos-master.INFO'
I0414 15:02:18.518971 20429 contender.cpp:147] Joining the ZK group
I0414 15:02:18.541162 20429 group.cpp:349] Group process
(group(3)@192.168.10.11:5050) connected to ZooKeeper
I0414 15:02:18.541213 20429 group.cpp:831] Syncing group operations: queue size
(joins, cancels, datas) = (1, 0, 0)
I0414 15:02:18.541229 20429 group.cpp:427] Trying to create path '/mesos' in
ZooKeeper
I0414 15:02:18.543774 20425 group.cpp:349] Group process
(group(1)@192.168.10.11:5050) connected to ZooKeeper
I0414 15:02:18.543800 20425 group.cpp:831] Syncing group operations: queue size
(joins, cancels, datas) = (0, 0, 0)
I0414 15:02:18.543810 20425 group.cpp:427] Trying to create path
'/mesos/log_replicas' in ZooKeeper
I0414 15:02:18.545526 20426 group.cpp:349] Group process
(group(4)@192.168.10.11:5050) connected to ZooKeeper
I0414 15:02:18.545588 20426 group.cpp:831] Syncing group operations: queue size
(joins, cancels, datas) = (0, 0, 0)
I0414 15:02:18.545627 20426 group.cpp:427] Trying to create path '/mesos' in
ZooKeeper
I0414 15:02:18.551719 20424 group.cpp:349] Group process
(group(2)@192.168.10.11:5050) connected to ZooKeeper
I0414 15:02:18.551811 20424 group.cpp:831] Syncing group operations: queue size
(joins, cancels, datas) = (1, 0, 0)
I0414 15:02:18.551832 20424 group.cpp:427] Trying to create path
'/mesos/log_replicas' in ZooKeeper
I0414 15:02:18.553040 20426 detector.cpp:154] Detected a new leader: (id='69')
I0414 15:02:18.553306 20426 group.cpp:700] Trying to get
'/mesos/json.info_0000000069' in ZooKeeper
I0414 15:02:18.553695 20425 network.hpp:413] ZooKeeper group memberships changed
I0414 15:02:18.553833 20425 group.cpp:700] Trying to get
'/mesos/log_replicas/0000000066' in ZooKeeper
I0414 15:02:18.556457 20426 detector.cpp:479] A new leading master
([email protected]:5050) is detected
I0414 15:02:18.556591 20426 master.cpp:1710] The newly elected leader is
[email protected]:5050 with id 32fd076d-e6cc-4fe0-acda-d5565bd98445
I0414 15:02:18.562369 20430 contender.cpp:263] New candidate (id='70') has
entered the contest for leadership
I0414 15:02:18.563021 20425 network.hpp:461] ZooKeeper group PIDs: {
log-replica(1)@192.168.100.54:5050 }
I0414 15:02:18.563916 20425 replica.cpp:673] Replica in EMPTY status received a
broadcasted recover request from (5)@192.168.10.11:5050
I0414 15:02:18.566625 20430 recover.cpp:193] Received a recover response from a
replica in EMPTY status
I0414 15:02:18.576733 20429 network.hpp:413] ZooKeeper group memberships changed
I0414 15:02:18.576817 20429 group.cpp:700] Trying to get
'/mesos/log_replicas/0000000066' in ZooKeeper
I0414 15:02:18.578048 20429 group.cpp:700] Trying to get
'/mesos/log_replicas/0000000067' in ZooKeeper
I0414 15:02:18.579957 20429 network.hpp:461] ZooKeeper group PIDs: {
log-replica(1)@192.168.10.11:5050, log-replica(1)@192.168.100.54:5050 }
I0414 15:02:28.518209 20432 recover.cpp:109] Unable to finish the recover
protocol in 10secs, retrying
I0414 15:02:28.518898 20429 replica.cpp:673] Replica in EMPTY status received a
broadcasted recover request from (10)@192.168.10.11:5050
I0414 15:02:28.518987 20429 recover.cpp:193] Received a recover response from a
replica in EMPTY status
I0414 15:02:38.519379 20432 recover.cpp:109] Unable to finish the recover
protocol in 10secs, retrying
I0414 15:02:38.520006 20429 replica.cpp:673] Replica in EMPTY status received a
broadcasted recover request from (12)@192.168.10.11:5050
I0414 15:02:38.520128 20429 recover.cpp:193] Received a recover response from a
replica in EMPTY status
I0414 15:02:48.520406 20432 recover.cpp:109] Unable to finish the recover
protocol in 10secs, retrying
I0414 15:02:48.521069 20429 replica.cpp:673] Replica in EMPTY status received a
broadcasted recover request from (14)@192.168.10.11:5050
I0414 15:02:48.521224 20429 recover.cpp:193] Received a recover response from a
replica in EMPTY status
I0414 15:02:50.335360 20429 http.cpp:501] HTTP GET for /master/state.json from
131.154.5.22:59543 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X
10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87
Safari/537.36 OPR/36.0.2130.46'
I0414 15:02:58.521517 20432 recover.cpp:109] Unable to finish the recover
protocol in 10secs, retrying
I0414 15:02:58.522234 20429 replica.cpp:673] Replica in EMPTY status received a
broadcasted recover request from (20)@192.168.10.11:5050
I0414 15:02:58.522333 20429 recover.cpp:193] Received a recover response from a
replica in EMPTY status
I0414 15:03:08.522389 20432 recover.cpp:109] Unable to finish the recover
protocol in 10secs, retrying
I0414 15:03:08.523116 20424 replica.cpp:673] Replica in EMPTY status received a
broadcasted recover request from (23)@192.168.10.11:5050
I0414 15:03:08.523236 20424 recover.cpp:193] Received a recover response from a
replica in EMPTY status
I0414 15:03:16.019850 20428 network.hpp:413] ZooKeeper group memberships changed
I0414 15:03:16.020007 20428 group.cpp:700] Trying to get
'/mesos/log_replicas/0000000067' in ZooKeeper
I0414 15:03:16.024132 20427 detector.cpp:154] Detected a new leader: (id='70')
I0414 15:03:16.024277 20427 group.cpp:700] Trying to get
'/mesos/json.info_0000000070' in ZooKeeper
I0414 15:03:16.024700 20428 network.hpp:461] ZooKeeper group PIDs: {
log-replica(1)@192.168.10.11:5050 }
I0414 15:03:16.029292 20427 detector.cpp:479] A new leading master
([email protected]:5050) is detected
I0414 15:03:16.029399 20427 master.cpp:1710] The newly elected leader is
[email protected]:5050 with id de75d47e-1791-4ab7-ac13-7c927873b035
I0414 15:03:16.029422 20427 master.cpp:1723] Elected as the leading master!
I0414 15:03:16.029444 20427 master.cpp:1468] Recovering from registrar
I0414 15:03:16.029558 20427 registrar.cpp:307] Recovering registrar
I0414 15:03:18.523638 20432 recover.cpp:109] Unable to finish the recover
protocol in 10secs, retrying
I0414 15:03:26.609001 20428 network.hpp:413] ZooKeeper group memberships changed
I0414 15:03:26.609223 20428 group.cpp:700] Trying to get
'/mesos/log_replicas/0000000067' in ZooKeeper
I0414 15:03:26.611070 20428 group.cpp:700] Trying to get
'/mesos/log_replicas/0000000068' in ZooKeeper
I0414 15:03:26.612923 20428 network.hpp:461] ZooKeeper group PIDs: {
log-replica(1)@192.168.10.11:5050, log-replica(1)@192.168.100.54:5050 }
I0414 15:03:26.613404 20428 replica.cpp:673] Replica in EMPTY status received a
broadcasted recover request from (28)@192.168.10.11:5050
I0414 15:03:26.613497 20428 recover.cpp:193] Received a recover response from a
replica in EMPTY status
I0414 15:03:28.524957 20432 recover.cpp:109] Unable to finish the recover
protocol in 10secs, retrying
I0414 15:03:28.525674 20428 replica.cpp:673] Replica in EMPTY status received a
broadcasted recover request from (30)@192.168.10.11:5050
I0414 15:03:28.525764 20428 recover.cpp:193] Received a recover response from a
replica in EMPTY status
I0414 15:03:38.525599 20432 recover.cpp:109] Unable to finish the recover
protocol in 10secs, retrying
I0414 15:03:38.526219 20428 replica.cpp:673] Replica in EMPTY status received a
broadcasted recover request from (32)@192.168.10.11:5050
I0414 15:03:38.526304 20428 recover.cpp:193] Received a recover response from a
replica in EMPTY status
I know that it is an unusual use of mesos clusters, but my thesis aim is
exaclty this one.
Thanks to all and Best regards.
Stefano
> Mesos Masters Leader Keeps Fluctuating
> --------------------------------------
>
> Key: MESOS-5207
> URL: https://issues.apache.org/jira/browse/MESOS-5207
> Project: Mesos
> Issue Type: Bug
> Reporter: haosdent
> Assignee: haosdent
>
> Report from user mailing list. [Mesos mail # user Re: Mesos Masters Leader
> Keeps Fluctuating|http://search-hadoop.com/m/0Vlr69BZgz1NlAPP1]
> From suruchi:
> {quote}
> Hi,
>
> I have set the quorum value as 2 as I have configured 3 master machines in my
> environment.
>
> But I don’t know why my leader master keeps fluctuating.
> {quote}
> From Stefano Bianchi:
> {quote}
> i joint to this discussion.
> i'm currently re setting op a cluster, but since i don't have much resources
> i need to set 2 masters.
> in this case the quorum valute set to 2 is correct?
> The problem i notice is that when i connect my 2 mesos masters the leader
> after few seconds id disconnecter: Failed to connect to...
> then the other master becomes the leader, but after a while again Failed to
> connect to...message.
> i notice that i always used mesos 0.27 and this problem happen with mesos
> 0.28.
> ...
> However in the previous configuration the switch between two masters was ok,
> just when the master was leading after, more or less 30 seconds, there was
> that Failed to connect message.
> {quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)