[ 
https://issues.apache.org/jira/browse/MESOS-4795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-4795:
----------------------------
    Description: 
Here's the sequence of events that happened:

-Agent running fine with 0.24.1
-Transient ZK issues, slave flapping with zookeeper_init failure
-ZK issue resolved
-Most agents stop flapping and function correctly
-Some agents continue flapping, but silent exit after printing the 
detector.cpp:481 log line.
-The agents that continue to flap repaired with manual removal of contents in 
mesos-slave's working dir

Here's the contents of the various log files on the agent:

The .INFO logfile for one of the restarts before mesos-slave process exited 
with no other error messages:
{code}
Log file created at: 2016/02/09 02:12:48
Running on machine: titusagent-main-i-7697a9c5
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging started!
I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07 by builds
I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation: 
posix/cpu,posix/mem,filesystem/posix
I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 
1)@10.138.146.230:7101
I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup: 
--appc_store_dir="/tmp/mesos/store/appc" --attributes="region:us-east-1;<snip>" 
--authenticatee="<snip>" --cgroups_cpu_enable_pids_and_tids_count="false" 
--cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" 
--cgroups_limit_swap="false" --cgroups_root="mesos" 
--container_disk_watch_interval="15secs" --containerizers="mesos" <snip>"
I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources: 
ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: <snip>
I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
I0209 02:12:48.516139 97299 group.cpp:331] Group process 
(group(1)@10.138.146.230:7101) connected to ZooKeeper
I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations: queue size 
(joins, cancels, datas) = (0, 0, 0)
I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path 
'/titus/main/mesos' in ZooKeeper
I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader: (id='209')
I0209 02:12:48.520803 97284 group.cpp:674] Trying to get 
'/titus/main/mesos/json.info_0000000209' in ZooKeeper
I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from 
'/mnt/data/mesos/meta'
I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources file 
'/mnt/data/mesos/meta/resources/resources.info'
I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master 
([email protected]:7103) is detected
{code}

The .FATAL log file when the original transient ZK error occurred:
{code}
Log file created at: 2016/02/05 17:21:37
Running on machine: titusagent-main-i-7697a9c5
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
{code}

The .ERROR log file:
{code}
Log file created at: 2016/02/05 17:21:37
Running on machine: titusagent-main-i-7697a9c5
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
{code}
The .WARNING file had the same content. 

  was:
Here's the sequence of events that happened:

-Agent running fine with 0.24.1
-Transient ZK issues, slave flapping with zookeeper_init failure
-ZK issue resolved
-Most agents stop flapping and function correctly
-Some agents continue flapping, but silent exit after printing the 
detector.cpp:481 log line.
-The agents that continue to flap repaired with manual removal of contents in 
mesos-slave's working dir

Here's the contents of the various log files on the agent:

The .INFO logfile for one of the restarts before mesos-slave process exited 
with no other error messages:

Log file created at: 2016/02/09 02:12:48
Running on machine: titusagent-main-i-7697a9c5
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging started!
I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07 by builds
I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation: 
posix/cpu,posix/mem,filesystem/posix
I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 
1)@10.138.146.230:7101
I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup: 
--appc_store_dir="/tmp/mesos/store/appc" --attributes="region:us-east-1;<snip>" 
--authenticatee="<snip>" --cgroups_cpu_enable_pids_and_tids_count="false" 
--cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" 
--cgroups_limit_swap="false" --cgroups_root="mesos" 
--container_disk_watch_interval="15secs" --containerizers="mesos" <snip>"
I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources: 
ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: <snip>
I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
I0209 02:12:48.516139 97299 group.cpp:331] Group process 
(group(1)@10.138.146.230:7101) connected to ZooKeeper
I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations: queue size 
(joins, cancels, datas) = (0, 0, 0)
I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path 
'/titus/main/mesos' in ZooKeeper
I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader: (id='209')
I0209 02:12:48.520803 97284 group.cpp:674] Trying to get 
'/titus/main/mesos/json.info_0000000209' in ZooKeeper
I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from 
'/mnt/data/mesos/meta'
I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources file 
'/mnt/data/mesos/meta/resources/resources.info'
I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master 
([email protected]:7103) is detected


The .FATAL log file when the original transient ZK error occurred:

Log file created at: 2016/02/05 17:21:37
Running on machine: titusagent-main-i-7697a9c5
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]


The .ERROR log file:

Log file created at: 2016/02/05 17:21:37
Running on machine: titusagent-main-i-7697a9c5
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]

The .WARNING file had the same content. 


> mesos agent not recovering after ZK init failure
> ------------------------------------------------
>
>                 Key: MESOS-4795
>                 URL: https://issues.apache.org/jira/browse/MESOS-4795
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.24.1
>            Reporter: Sharma Podila
>
> Here's the sequence of events that happened:
> -Agent running fine with 0.24.1
> -Transient ZK issues, slave flapping with zookeeper_init failure
> -ZK issue resolved
> -Most agents stop flapping and function correctly
> -Some agents continue flapping, but silent exit after printing the 
> detector.cpp:481 log line.
> -The agents that continue to flap repaired with manual removal of contents in 
> mesos-slave's working dir
> Here's the contents of the various log files on the agent:
> The .INFO logfile for one of the restarts before mesos-slave process exited 
> with no other error messages:
> {code}
> Log file created at: 2016/02/09 02:12:48
> Running on machine: titusagent-main-i-7697a9c5
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging started!
> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07 by builds
> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix
> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 
> 1)@10.138.146.230:7101
> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup: 
> --appc_store_dir="/tmp/mesos/store/appc" 
> --attributes="region:us-east-1;<snip>" --authenticatee="<snip>" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" <snip>"
> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources: 
> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: <snip>
> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
> I0209 02:12:48.516139 97299 group.cpp:331] Group process 
> (group(1)@10.138.146.230:7101) connected to ZooKeeper
> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations: queue 
> size (joins, cancels, datas) = (0, 0, 0)
> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path 
> '/titus/main/mesos' in ZooKeeper
> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader: 
> (id='209')
> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get 
> '/titus/main/mesos/json.info_0000000209' in ZooKeeper
> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from 
> '/mnt/data/mesos/meta'
> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources file 
> '/mnt/data/mesos/meta/resources/resources.info'
> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master 
> ([email protected]:7103) is detected
> {code}
> The .FATAL log file when the original transient ZK error occurred:
> {code}
> Log file created at: 2016/02/05 17:21:37
> Running on machine: titusagent-main-i-7697a9c5
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> {code}
> The .ERROR log file:
> {code}
> Log file created at: 2016/02/05 17:21:37
> Running on machine: titusagent-main-i-7697a9c5
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> {code}
> The .WARNING file had the same content. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to