[ 
https://issues.apache.org/jira/browse/MESOS-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler resolved MESOS-450.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.13.0

Review was here, but lost: https://reviews.apache.org/r/10534

commit a6f722bacb7949199412b1c7a3af09699301142c
Author: Benjamin Mahler <[email protected]>
Date:   Tue Apr 30 01:31:45 2013 +0000

    Removed the hostname:port slave map from the Master.

    Review: https://reviews.apache.org/r/10534

    git-svn-id: https://svn.apache.org/repos/asf/incubator/mesos/trunk@1477443 
13f79535-47bb-0310-9956-ffa450edef68
                
> The master should shut down slaves upon removal.
> ------------------------------------------------
>
>                 Key: MESOS-450
>                 URL: https://issues.apache.org/jira/browse/MESOS-450
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>             Fix For: 0.13.0
>
>
> The following scenario was observed in a production environment.
> 1. The slave terminated due to a machine reboot. The last message from the 
> slave prior to dying:
>     I0419 20:04:47.483470 38030 cgroups_isolation_module.cpp:774] Updated 
> 'memory.soft_limit_in_bytes' to 134217728 for <executor> of framework 
> <framework>
> 2. The master then deactivated the slave due to a health check failure:
>     W0419 20:06:15.952864 28740 master.cpp:1172] Removing slave 
> 201303271650-1944527370-5050-24955-2642 at <slave> because it has been 
> deactivated
>     I0419 20:06:15.956354 28740 master.cpp:1181] Master now considering a 
> slave at <slave> as inactive
> 3. The slave came back up and registered 4 minutes later:
>     I0419 20:10:19.693114  8181 main.cpp:131] Starting Mesos slave
>     I0419 20:10:31.314707  8201 slave.cpp:398] Registered with master; given 
> slave ID 201304011727-2230002186-5050-28738-377
> 4. The master added the slave correctly (the slave had some registration 
> retries):
>     I0419 20:10:22.171288 28742 master.cpp:924] Attempting to register slave 
> on <slave> at slave(1)@<ip>
>     I0419 20:10:22.171336 28742 master.cpp:1159] Master now considering a 
> slave at <slave> as active
>     I0419 20:10:22.171361 28742 master.cpp:1735] Adding slave 
> 201304011727-2230002186-5050-28738-377 at <slave> with cpus=22; mem=21962; 
> ports=[31000-32000]; disk=400000
>     I0419 20:10:22.171661 28739 hierarchical_allocator_process.hpp:393] Added 
> slave 201304011727-2230002186-5050-28738-377 (<slave>) with cpus=22; 
> mem=21962; ports=[31000-32000]; disk=400000 (and cpus=22; mem=21962; 
> ports=[31000-32000]; disk=400000 available)
> ...
>     I0419 20:10:23.171977 28754 master.cpp:913] Slave 
> 201304011727-2230002186-5050-28738-377 (<slave>) already registered, 
> resending acknowledgement
>     ...
>     I0419 20:10:24.172746 28752 master.cpp:913] Slave 
> 201304011727-2230002186-5050-28738-377 (<slave>) already registered, 
> resending acknowledgement
>     ...
>     I0419 20:10:25.173225 28747 master.cpp:913] Slave 
> 201304011727-2230002186-5050-28738-377 (<slave>) already registered, 
> resending acknowledgement
>     ...
>     I0419 20:10:26.174988 28740 master.cpp:913] Slave 
> 201304011727-2230002186-5050-28738-377 (<slave>) already registered, 
> resending acknowledgement
>     ...
>     I0419 20:10:27.176584 28739 master.cpp:913] Slave 
> 201304011727-2230002186-5050-28738-377 (<slave>) already registered, 
> resending acknowledgement
>     ...
>     I0419 20:10:28.178421 28754 master.cpp:913] Slave 
> 201304011727-2230002186-5050-28738-377 (<slave>) already registered, 
> resending acknowledgement
>     ...
>     I0419 20:10:29.179831 28740 master.cpp:913] Slave 
> 201304011727-2230002186-5050-28738-377 (<slave>) already registered, 
> resending acknowledgement
>     ...
>     I0419 20:10:31.312396 28750 master.cpp:913] Slave 
> 201304011727-2230002186-5050-28738-377 (<slave>) already registered, 
> resending acknowledgement
> 5. Then, ~6 minutes after the original slave went down, the master received 
> an exited event from the slave PID. It's highly likely this was a delayed 
> signal for the old slave:
>     I0419 20:10:31.313608 28750 master.cpp:518] Slave 
> 201304011727-2230002186-5050-28738-377(<slave>) disconnected
> This stems from the fact that we do not know the SlaveID for the exited 
> event, so we incorrectly remove the slave. One fix for this is to tell the 
> slave to shut down when we remove it.
> Without the proposed fix, the slave remained in a state of thinking it is 
> registered, but the master disagrees.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to