[
https://issues.apache.org/jira/browse/MESOS-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13921588#comment-13921588
]
Benjamin Hindman commented on MESOS-1058:
-----------------------------------------
Waiting for the registrar to fix the second issue SGTM.
> Master CHECK failure: hierarchical_allocator_process.hpp:421 Check failed:
> !slaves.contains(slaveId)
> ----------------------------------------------------------------------------------------------------
>
> Key: MESOS-1058
> URL: https://issues.apache.org/jira/browse/MESOS-1058
> Project: Mesos
> Issue Type: Bug
> Components: master, slave
> Affects Versions: 0.17.0, 0.16.0, 0.15.0, 0.18.0
> Reporter: Benjamin Mahler
> Assignee: Benjamin Mahler
> Fix For: 0.19.0
>
>
> We've observed this CHECK failure in production when the following situation
> occurs:
> 1. Slave asks to Register with Master.
> 2. Master adds slave with ID 1 and sends acknowledgment.
> 3. Acknowledgement to the slave is dropped due to one-way partition.
> 4. Slave continues to retry.
> 5. Master detects spurious socket closure on slave, marks slave as
> disconnected.
> 6. Slave did not exit, re-detects Master, and asks to Register.
> 7. Master::registerSlave decides to remove "old disconnected slave".
> BUG: Master::removeSlave does not remove the old slave from the allocator!
> 8. Master::registerSlave adds slave with ID 2 and sends acknowledgement.
> 9. Slave receives ID 1 acknowledgement, and checkpoints.
> 10. Slave receives ID 2 acknowledgement, and exits from mismatch.
> 11. Slave recovers and attempts to re-register with checkpointed ID 1.
> 12. Master allows this (no Registrar yet), and attempts to add the slave to
> the allocator (because of BUG above, CHECK fails in the allocator).
> The first bug here is that the Master does not remove a slave from the
> allocator in Master::removeSlave if the slave is disconnected! This was
> likely a regression when Allocator::slaveDisconnected was introduced, and we
> neglected to make the necessary update to Master::removeSlave. This is an
> easy fix.
> The second bug is that the Slave's ID was inconsistent with the Master, and
> the slave exited, only to re-register with the inconsistent ID. If the above
> bug is fixed, this means we'll allow the slave to re-register in the Master
> after having told frameworks the slave is lost. I'm tempted to punt on this
> bug since with the Registrar, this situation would be prevented as the
> re-registration would be denied. Also, we already expose this edge-case slave
> inconsistency to frameworks in other situations without the Registrar.
--
This message was sent by Atlassian JIRA
(v6.2#6252)