[ https://issues.apache.org/jira/browse/MESOS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dominic Hamon updated MESOS-1517: --------------------------------- Labels: reliability twitter (was: reliability) > Maintain a queue of messages that arrive before the master recovers. > -------------------------------------------------------------------- > > Key: MESOS-1517 > URL: https://issues.apache.org/jira/browse/MESOS-1517 > Project: Mesos > Issue Type: Improvement > Components: master > Reporter: Benjamin Mahler > Labels: reliability, twitter > > Currently when the master is recovering, we drop all incoming messages. If > slaves and frameworks knew about the leading master only once it has > recovered, then we would only expect to see messages after we've recovered. > We previously considered enqueuing all messages through the recovery future, > but this has the downside of forcing all messages to go through the master's > queue twice: > {code} > // TODO(bmahler): Consider instead re-enqueing *all* messages > // through recover(). What are the performance implications of > // the additional queueing delay and the accumulated backlog > // of messages post-recovery? > if (!recovered.get().isReady()) { > VLOG(1) << "Dropping '" << event.message->name << "' message since " > << "not recovered yet"; > ++metrics.dropped_messages; > return; > } > {code} > However, an easy solution to this problem is to maintain an explicit queue of > incoming messages that gets flushed once we finish recovery. This ensures > that all messages post-recovery are processed normally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)