[ 
https://issues.apache.org/jira/browse/MESOS-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13921588#comment-13921588
 ] 

Benjamin Hindman commented on MESOS-1058:
-----------------------------------------

Waiting for the registrar to fix the second issue SGTM.

> Master CHECK failure: hierarchical_allocator_process.hpp:421 Check failed: 
> !slaves.contains(slaveId)
> ----------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-1058
>                 URL: https://issues.apache.org/jira/browse/MESOS-1058
>             Project: Mesos
>          Issue Type: Bug
>          Components: master, slave
>    Affects Versions: 0.17.0, 0.16.0, 0.15.0, 0.18.0
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>             Fix For: 0.19.0
>
>
> We've observed this CHECK failure in production when the following situation 
> occurs:
> 1. Slave asks to Register with Master.
> 2. Master adds slave with ID 1 and sends acknowledgment.
> 3. Acknowledgement to the slave is dropped due to one-way partition.
> 4. Slave continues to retry.
> 5. Master detects spurious socket closure on slave, marks slave as 
> disconnected.
> 6. Slave did not exit, re-detects Master, and asks to Register.
> 7. Master::registerSlave decides to remove "old disconnected slave".
> BUG: Master::removeSlave does not remove the old slave from the allocator!
> 8. Master::registerSlave adds slave with ID 2 and sends acknowledgement.
> 9. Slave receives ID 1 acknowledgement, and checkpoints.
> 10. Slave receives ID 2 acknowledgement, and exits from mismatch.
> 11. Slave recovers and attempts to re-register with checkpointed ID 1.
> 12. Master allows this (no Registrar yet), and attempts to add the slave to 
> the allocator (because of BUG above, CHECK fails in the allocator).
> The first bug here is that the Master does not remove a slave from the 
> allocator in Master::removeSlave if the slave is disconnected! This was 
> likely a regression when Allocator::slaveDisconnected was introduced, and we 
> neglected to make the necessary update to Master::removeSlave. This is an 
> easy fix.
> The second bug is that the Slave's ID was inconsistent with the Master, and 
> the slave exited, only to re-register with the inconsistent ID. If the above 
> bug is fixed, this means we'll allow the slave to re-register in the Master 
> after having told frameworks the slave is lost. I'm tempted to punt on this 
> bug since with the Registrar, this situation would be prevented as the 
> re-registration would be denied. Also, we already expose this edge-case slave 
> inconsistency to frameworks in other situations without the Registrar.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to