[ 
https://issues.apache.org/jira/browse/MESOS-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756949#comment-13756949
 ] 

Benjamin Mahler commented on MESOS-675:
---------------------------------------

{noformat}
$ grep 201308270033-1937777162-5050-50911-137 /var/log/mesos/mesos-master.log | 
grep disconnected
I0903 01:52:54.441226 13322 master.cpp:553] Slave 
201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) 
disconnected
I0903 01:53:28.456810 13324 master.cpp:553] Slave 
201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) 
disconnected
I0903 01:54:36.494281 13314 master.cpp:553] Slave 
201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) 
disconnected
I0903 01:56:45.629204 13313 master.cpp:553] Slave 
201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) 
disconnected
I0903 01:57:08.661918 13321 master.cpp:553] Slave 
201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) 
disconnected
I0903 01:58:07.367280 13317 master.cpp:553] Slave 
201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) 
disconnected
I0903 02:00:55.511580 13323 master.cpp:553] Slave 
201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) 
disconnected
I0903 02:01:22.640902 13314 master.cpp:553] Slave 
201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) 
disconnected
W0903 02:01:22.641563 13314 master.cpp:1404] Master returning resources offered 
because slave 201308270033-1937777162-5050-50911-137 is disconnected
I0903 02:01:22.646209 13314 master.cpp:553] Slave 
201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) 
disconnected
F0903 02:01:22.646548 13314 master.cpp:564] Check failed: !slave->disconnected 
Slave 201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) 
already disconnected!
{noformat}

{noformat}
$ grep Starting /var/log/mesos/old/mesos-slave.log.16
I0903 01:57:09.531086 40675 main.cpp:128] Starting Mesos slave
I0903 01:58:09.686590 41040 main.cpp:128] Starting Mesos slave
I0903 02:01:00.834357 42271 main.cpp:128] Starting Mesos slave
I0903 02:01:21.470432 42408 main.cpp:128] Starting Mesos slave
I0903 02:01:31.957490 42513 main.cpp:128] Starting Mesos slave
I0903 02:05:57.852473 44105 main.cpp:128] Starting Mesos slave
{noformat}

Comparing the disconnects with the restarts, it does look like two exited 
events were received on the master for the restart at 02:01:21. Will have to 
walk through libprocess to see how this might occur.
                
> CHECK failure in the Master.
> ----------------------------
>
>                 Key: MESOS-675
>                 URL: https://issues.apache.org/jira/browse/MESOS-675
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>            Priority: Blocker
>             Fix For: 0.14.0
>
>
> Observed this failure in a staging cluster running 0.14.0-rc2.
> {noformat}
> F0902 06:01:11.105391 11876 master.cpp:564] Check failed: 
> !slave->disconnected Slave 201308270033-1937777162-5050-50911-137 (<scrub>)
>  already disconnected!
> *** Check failure stack trace: ***
>     @     0x7fb470894d8d  google::LogMessage::Fail()
>     @     0x7fb470898d77  google::LogMessage::SendToLog()
>     @     0x7fb470897674  google::LogMessage::Flush()
>     @     0x7fb4708978a6  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7fb4704aaea4  mesos::internal::master::Master::exited()
>     @     0x7fb470786af4  process::ProcessManager::resume()
>     @     0x7fb47078754f  process::schedule()
>     @     0x7fb46fef483d  start_thread
>     @     0x7fb46e8d6f8d  clone
> {noformat}
> Grepping for this slave in the logs:
> {noformat}
> $ grep 201308270033-1937777162-5050-50911-137 /var/log/mesos/mesos-master.log
> W0902 06:01:10.607168 11876 master.cpp:1317] Ignoring unknown exited executor 
> thermos-1377831261464-mesos-slave-recovery-spinner-60-f0bcfda6-4f8d-4df4-bd74-0b15f32d0502
>  on slave 201308270033-1937777162-5050-50911-137 (<scrub>)
> ...
> W0902 06:01:10.646383 11876 master.cpp:1317] Ignoring unknown exited executor 
> thermos-1377964938274-mesos-slave-recovery-spinner-184-3a25b824-5d73-4be0-984d-606230c5e8ac
>  on slave 201308270033-1937777162-5050-50911-137 (<scrub>)
> W0902 06:01:10.699635 11876 master.cpp:1123] Slave at 
> slave(1)@10.34.110.125:5051 (<scrub>) is being allowed to re-register with an 
> already in use id (201308270033-1937777162-5050-50911-137)
> I0902 06:01:10.700628 11868 hierarchical_allocator_process.hpp:434] Added 
> slave 201308270033-1937777162-5050-50911-137 (<scrub>) with cpus(*):14; 
> mem(*):21913; ports(*):[31000-32000]; disk(*):400000 (and cpus(*):10.96; 
> mem(*):19866; ports(*):[31000-31003, 31005-31449, 31451-31580, 31582-31801, 
> 31803-31927, 31929-32000]; disk(*):397809 available)
> W0902 06:01:10.866525 11876 master.cpp:1123] Slave at 
> slave(1)@10.34.110.125:5051 (<scrub>) is being allowed to re-register with an 
> already in use id (201308270033-1937777162-5050-50911-137)
> W0902 06:01:10.919178 11876 master.cpp:1123] Slave at 
> slave(1)@10.34.110.125:5051 (<scrub>) is being allowed to re-register with an 
> already in use id (201308270033-1937777162-5050-50911-137)
> W0902 06:01:11.070862 11876 master.cpp:1123] Slave at 
> slave(1)@10.34.110.125:5051 (<scrub>) is being allowed to re-register with an 
> already in use id (201308270033-1937777162-5050-50911-137)
> I0902 06:01:11.085773 11876 master.cpp:553] Slave 
> 201308270033-1937777162-5050-50911-137 (<scrub>) disconnected
> W0902 06:01:11.086096 11876 master.cpp:1404] Master returning resources 
> offered because slave 201308270033-1937777162-5050-50911-137 is disconnected
> I0902 06:01:11.086145 11867 hierarchical_allocator_process.hpp:459] Removed 
> slave 201308270033-1937777162-5050-50911-137
> I0902 06:01:11.104651 11876 master.cpp:553] Slave 
> 201308270033-1937777162-5050-50911-137 (<scrub>) disconnected
> F0902 06:01:11.105391 11876 master.cpp:564] Check failed: 
> !slave->disconnected Slave 201308270033-1937777162-5050-50911-137 (<scrub>) 
> already disconnected!
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to