analogue created MESOS-5427:
-------------------------------
Summary: Mesos master locks up after slave fails to authenticate
Key: MESOS-5427
URL: https://issues.apache.org/jira/browse/MESOS-5427
Project: Mesos
Issue Type: Bug
Components: master
Affects Versions: 0.20.1
Environment: Linux XXXXXX-XXXXXXXXX 3.13.0-49-generic #81-Ubuntu SMP
Tue Mar 24 19:29:48 UTC 2015 x86_64 GNU/Linux
Ubuntu 10.04.1 LTS
AWS/8cores/16GB
Reporter: analogue
In a mesos master cluster with one leader and two backups, a single slave
attempting to authenticate with the leader locked up the master and resulted in
2 CPU cores pegged at 100% CPU usage until restarted.
master
{noformat}
I0516 02:55:39.945566 32126 master.cpp:3612] Authenticating
slave(1)@10.85.20.76:5051
I0516 02:55:39.945757 32126 master.cpp:3598] Queuing up authentication request
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.945802 32123 authenticator.hpp:156] Creating new server SASL
connection
I0516 02:55:39.945991 32126 master.cpp:3598] Queuing up authentication request
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946030 32126 master.cpp:3598] Queuing up authentication request
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946063 32126 master.cpp:3598] Queuing up authentication request
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946095 32126 master.cpp:3598] Queuing up authentication request
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946126 32126 master.cpp:3598] Queuing up authentication request
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946158 32126 master.cpp:3598] Queuing up authentication request
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946189 32126 master.cpp:3598] Queuing up authentication request
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946221 32126 master.cpp:3598] Queuing up authentication request
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946252 32126 master.cpp:3598] Queuing up authentication request
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946285 32126 master.cpp:3598] Queuing up authentication request
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946316 32126 master.cpp:3598] Queuing up authentication request
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946347 32126 master.cpp:3598] Queuing up authentication request
from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946379 32126 master.cpp:3598] Queuing up authentication request
from slave(1)@10.85.20.76:5051 because authentication is still in progress
...
W0516 02:55:44.945811 32124 master.cpp:3670] Authentication timed out
I0516 02:55:49.290623 32121 master.cpp:3598] Queuing up authentication request
from slave(1)@10.85.20.76:5051 because authentication is still in progress
(last long line repeats until mesos-master restarted)
{noformat}
slave
{noformat}
Log file created at: 2016/05/16 02:37:52
Running on machine: 10-85-20-76-uswest2btestopia
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0516 02:37:52.112509 10198 logging.cpp:142] INFO level logging started!
I0516 02:37:52.112761 10198 main.cpp:126] Build: 2014-12-12 00:52:32 by
I0516 02:37:52.112772 10198 main.cpp:128] Version: 0.20.1
I0516 02:37:52.112778 10198 main.cpp:131] Git tag: 0.20.1
I0516 02:37:52.112783 10198 main.cpp:135] Git SHA:
fe0a39112f3304283f970f1b08b322b1e970829d
I0516 02:37:52.112793 10198 containerizer.cpp:89] Using isolation:
cgroups/cpu,cgroups/mem
I0516 02:37:52.125773 10198 linux_launcher.cpp:78] Using /sys/fs/cgroup/freezer
as the freezer hierarchy for the Linux launcher
I0516 02:37:52.126652 10198 main.cpp:149] Starting Mesos slave
I0516 02:37:52.128687 10246 slave.cpp:167] Slave started on 1)@10.85.20.76:5051
I0516 02:37:52.128708 10246 credentials.hpp:84] Loading credential for
authentication from '/etc/seagull_mesos_creds'
W0516 02:37:52.128865 10246 credentials.hpp:99] Permissions on credential file
'/etc/seagull_mesos_creds' are too open. It is recommended that your credential
file is NOT accessible by others.
I0516 02:37:52.128968 10246 slave.cpp:265] Slave using credential for:
seagull_slave
I0516 02:37:52.129612 10246 slave.cpp:278] Slave resources: cpus(*):31;
disk(*):140000; mem(*):59382; ports(*):[31000-32000]
I0516 02:37:52.132064 10255 group.cpp:313] Group process
(group(1)@10.85.20.76:5051) connected to ZooKeeper
I0516 02:37:52.132086 10255 group.cpp:787] Syncing group operations: queue size
(joins, cancels, datas) = (0, 0, 0)
I0516 02:37:52.132097 10255 group.cpp:385] Trying to create path
'/mesos-releng' in ZooKeeper
I0516 02:37:52.200781 10246 slave.cpp:306] Slave hostname:
10-85-20-76-uswest2btestopia.dev.yelpcorp.com
I0516 02:37:52.200804 10246 slave.cpp:307] Slave checkpoint: true
I0516 02:37:52.201323 10262 detector.cpp:138] Detected a new leader: (id='4733')
I0516 02:37:52.201484 10258 group.cpp:658] Trying to get
'/mesos-releng/info_0000004733' in ZooKeeper
I0516 02:37:52.202786 10262 detector.cpp:426] A new leading master
([email protected]:5050) is detected
I0516 02:37:52.203071 10257 state.cpp:33] Recovering state from
'/ephemeral/mesos-slave/meta'
I0516 02:37:52.203229 10253 status_update_manager.cpp:193] Recovering status
update manager
I0516 02:37:52.203358 10247 containerizer.cpp:252] Recovering containerizer
I0516 02:37:52.205095 10253 slave.cpp:3198] Finished recovery
I0516 02:37:52.206739 10264 slave.cpp:589] New master detected at
[email protected]:5050
I0516 02:37:52.207399 10264 slave.cpp:663] Authenticating with master
[email protected]:5050
I0516 02:37:52.207417 10253 status_update_manager.cpp:167] New master detected
at [email protected]:5050
I0516 02:37:52.207474 10264 slave.cpp:636] Detecting new master
I0516 02:37:52.207484 10256 authenticatee.hpp:104] Initializing client SASL
I0516 02:37:53.521589 10256 authenticatee.hpp:128] Creating new client SASL
connection
W0516 02:37:57.210635 10257 slave.cpp:737] Authentication timed out
W0516 02:37:57.210777 10257 slave.cpp:701] Failed to authenticate with master
{noformat}
No idea why the timestamps don't match up between the master and the slave but
ntp logs to syslog indicate there was no time skew at play here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)