[ 
https://issues.apache.org/jira/browse/MESOS-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756937#comment-13756937
 ] 

Benjamin Mahler commented on MESOS-676:
---------------------------------------

Looking closer at this, the slave did not finish recovery and so did not 
attempt to re-register.

However, during the recovery process, a re-registered message intended for the 
previous slave run was received. This occurred because the slave is restarting 
frequently.

If we'd like to fix this, a simple approach is to ignore any re-registered 
messages received while recovering as they cannot be intended for the current 
recovering slave.

A longer term solution here is to embed UUIDs or other identifying information 
so we can tie requests to responses: 
https://issues.apache.org/jira/browse/MESOS-677
                
> Slave::reregistered LOG(FATAL)s due to being in RECOVERING state.
> -----------------------------------------------------------------
>
>                 Key: MESOS-676
>                 URL: https://issues.apache.org/jira/browse/MESOS-676
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>             Fix For: 0.14.0
>
>
> void Slave::reregistered(const SlaveID& slaveId)
> {
>   switch(state) {
>     case DISCONNECTED:
>       LOG(INFO) << "Re-registered with master " << master;
>       state = RUNNING;
>       if (!(info.id() == slaveId)) {
>         EXIT(1) << "Re-registered but got wrong id: " << slaveId
>                 << "(expected: " << info.id() << "). Committing suicide";
>       }
>       break;
>     case RUNNING:
>       // Already re-registered!
>       if (!(info.id() == slaveId)) {
>         EXIT(1) << "Re-registered but got wrong id: " << slaveId
>                 << "(expected: " << info.id() << "). Committing suicide";
>       }
>       LOG(WARNING) << "Already re-registered with master " << master;
>       break;
>     case TERMINATING:
>       LOG(WARNING) << "Ignoring re-registration because slave is terminating";
>       break;
>     case RECOVERING:
>     default:
>       LOG(FATAL) << "Unexpected slave state " << state;
>       break;
>   }
> }
> Saw a slave fail because of this last case statement:
> F0903 02:01:26.436521 42417 slave.cpp:672] Unexpected slave state 0
> *** Check failure stack trace: ***
>     @     0x7f042c579d8d  google::LogMessage::Fail()
>     @     0x7f042c57dd77  google::LogMessage::SendToLog()
>     @     0x7f042c57c674  google::LogMessage::Flush()
>     @     0x7f042c57c8a6  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f042c21db8a  mesos::internal::slave::Slave::reregistered()
>     @     0x7f042c276c1d  ProtobufProcess<>::handler1<>()
>     @     0x7f042c24560a  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7f042c27702b  ProtobufProcess<>::visit()
>     @     0x7f042c46baf4  process::ProcessManager::resume()
>     @     0x7f042c46c54f  process::schedule()
>     @     0x7f042bbd983d  start_thread
>     @     0x7f042a5bbf8d  clone
> /usr/local/bin/mesos-slave.sh: line 117: 42408 Aborted                 (core 
> dumped) /usr/local/sbin/mesos-slave --port=5051 
> --resources="${MESOS_RESOURCES}" --attributes="${MESOS_ATTRIBUTES}" 
> --master="${master_zoo_url}" --log_dir="${log_dir}" ${EXTRA_FLAGS} "$@"
> Slave Exit Status: 134

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to