Christopher M Luciano created MESOS-5832:
--------------------------------------------
Summary: Mesos replicated log corruption with disconnects from ZK
Key: MESOS-5832
URL: https://issues.apache.org/jira/browse/MESOS-5832
Project: Mesos
Issue Type: Bug
Affects Versions: 0.27.1, 0.25.1
Reporter: Christopher M Luciano
Setup:
I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 )
running the mesos version 0.27.2 (confirmed to affect 0.25.0 also)
I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs the
same mesos version as the masters).
All of these were pointed at a single zookeeper ( NOT an ensemble ).
mesos-slave and mesos-master is run by upstart, and both are configured to be
restarted on halting/crashing.
Procedure:
1) I confirm a mesos master has been elected and all agents have been discovered
2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming
traffic from m1 and m2
3) the mesos-master process on m1 and m2 halt - upstart restarts them. They are
not able to communicate with zookeeper, and therefore are no longer considered
part of the cluster
4) A leader election happens ( m3 is elected leader )
4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop
mesos-slave - just killing it will cause it to be restarted)
5) I wait to confirm the slave is reported as down by m3
6) I add IPTABLES rules on the zookeeper machine to block all incoming traffic
from m3,m4, and m5
7) I confirm that the mesos-master process on m3,m4,and m5 have all halted and
restarted
8) I confirm that all masters report themselves as not in the cluster
9) I remove the IPTABLES rule from the zookeeper machine that is blocking all
traffic from m1 and m2
10) m1 and m2 now report they are part of the cluster - there is a leader
election and either m1 or m2 is now elected leader. NOTE : because the cluster
does not have quorum, no agents are listed.
11) I shutdown the mesos-slave process on a2
12) In the logs of the current master, I can see this information being
processed by the master.
13) I add IPTABLES rules on the zookeeper machine to block all masters
14) I wait for all masters to report themselves as not being in the cluster
15) I remove all IPTABLES rules on the zookeeper machine
16) All masters join the cluster, and a leader election happens
17) After ten minutes, the leader's mesos-master process will halt, a leader
election will happen...and this repeats every 10 minutes
Summary :
Here is what I think is happening in the above test case : I think that at the
end of step 16, the masters all try to do replica log reconciliation, and
can't. I think the state of the agents isn't actually relevant - the replica
log reconciliation causes a hang or a silent failure. After 10 minutes, it hits
a timeout for communicating with the registry (i.e. zookeeper) - even though it
can communicate with zookeeper, it never does because of the previous
hanging/silent failure.
Attached is a perl script I used on the zookeeper machine to automate the steps
above. If you want to use it, you'll need to change the IPs set in the script,
and make sure that one of the first 2 ips is the current mesos master.
Setup:
I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 )
running the mesos version 0.27.2 (confirmed to affect 0.25.0 also)
I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs the
same mesos version as the masters).
All of these were pointed at a single zookeeper ( NOT an ensemble ).
mesos-slave and mesos-master is run by upstart, and both are configured to be
restarted on halting/crashing.
Procedure:
1) I confirm a mesos master has been elected and all agents have been discovered
2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming
traffic from m1 and m2
3) the mesos-master process on m1 and m2 halt - upstart restarts them. They are
not able to communicate with zookeeper, and therefore are no longer considered
part of the cluster
4) A leader election happens ( m3 is elected leader )
4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop
mesos-slave - just killing it will cause it to be restarted)
5) I wait to confirm the slave is reported as down by m3
6) I add IPTABLES rules on the zookeeper machine to block all incoming traffic
from m3,m4, and m5
7) I confirm that the mesos-master process on m3,m4,and m5 have all halted and
restarted
8) I confirm that all masters report themselves as not in the cluster
9) I remove the IPTABLES rule from the zookeeper machine that is blocking all
traffic from m1 and m2
10) m1 and m2 now report they are part of the cluster - there is a leader
election and either m1 or m2 is now elected leader. NOTE : because the cluster
does not have quorum, no agents are listed.
11) I shutdown the mesos-slave process on a2
12) In the logs of the current master, I can see this information being
processed by the master.
13) I add IPTABLES rules on the zookeeper machine to block all masters
14) I wait for all masters to report themselves as not being in the cluster
15) I remove all IPTABLES rules on the zookeeper machine
16) All masters join the cluster, and a leader election happens
17) After ten minutes, the leader's mesos-master process will halt, a leader
election will happen...and this repeats every 10 minutes
Summary :
Here is what I think is happening in the above test case : I think that at the
end of step 16, the masters all try to do replica log reconciliation, and
can't. I think the state of the agents isn't actually relevant - the replica
log reconciliation causes a hang or a silent failure. After 10 minutes, it hits
a timeout for communicating with the registry (i.e. zookeeper) - even though it
can communicate with zookeeper, it never does because of the previous
hanging/silent failure.
Attached is a perl script I used on the zookeeper machine to automate the steps
above. If you want to use it, you'll need to change the IPs set in the script,
and make sure that one of the first 2 ips is the current mesos master.
sub drop_it{
print "dropping $_[0]\n";
`iptables -I INPUT -s $_[0] -j DROP;`;
}
sub drop_agent{
print "dropping agent $_[0]\n";
print `ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no
root\@$_[0] "sudo initctl stop mesos-slave"`
}
sub revive_it{
print "reviviing $_[0]\n";
`iptables -D INPUT -s $_[0] -j DROP;`;
}
$master_1='10.xx.xx.xx’;
$master_2='10.xx.xx.xx';
$master_3='10.xx.xx.xx';
$master_4='10.xx.xx.xx';
$master_5='10.xx.xx.xx';
$agent_1='10.xx.xx.xx';
$agent_2='10.xx.xx.xx';
drop_it($master_1);
drop_it($master_2);
sleep(20);
drop_agent($agent_1);
sleep(20);
drop_it($master_3);
drop_it($master_4);
drop_it($master_5);
sleep(20);
revive_it($master_1);
revive_it($master_2);
sleep(180);
drop_agent($agent_2);
sleep(20);
drop_it($master_1);
drop_it($master_2);
sleep(20);
revive_it($master_1);
revive_it($master_2);
revive_it($master_3);
revive_it($master_4);
revive_it($master_5);
The end outcome of the above is a replicated log that can never be resolved. We
keep hitting registrar timeouts and must blow away the log on all the masters
in order for it to be recreated and for the cluster to resolve.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)