[ 
https://issues.apache.org/jira/browse/MESOS-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1058:
-----------------------------------

    Description: 
We've observed this CHECK failure in production when the following situation 
occurs:

1. Slave asks to Register with Master.
2. Master adds slave with ID 1 and sends acknowledgment.
3. Acknowledgement to the slave is dropped due to one-way partition.
4. Slave continues to retry.
5. Master detects spurious socket closure on slave, marks slave as disconnected.
6. Slave did not exit, re-detects Master, and asks to Register.
7. Master::registerSlave decides to remove "old disconnected slave".
BUG: Master::removeSlave does not remove the old slave from the allocator!
8. Master::registerSlave adds slave with ID 2 and sends acknowledgement.
9. Slave receives ID 1 acknowledgement, and checkpoints.
10. Slave receives ID 2 acknowledgement, and exits from mismatch.
11. Slave recovers and attempts to re-register with checkpointed ID 1.
12. Master allows this (no Registrar yet), and attempts to add the slave to the 
allocator (because of BUG above, CHECK fails in the allocator).

The first bug here is that the Master does not remove a slave from the 
allocator in Master::removeSlave if the slave is disconnected! This was likely 
a regression when Allocator::slaveDisconnected was introduced, and we neglected 
to make the necessary update to Master::removeSlave. This is an easy fix.

The second bug is that the Slave's ID was inconsistent with the Master, and the 
slave exited, only to re-register with the inconsistent ID. If the above bug is 
fixed, this means we'll allow the slave to re-register in the Master after 
having told frameworks the slave is lost. I'm tempted to punt on this bug since 
with the Registrar, this situation would be prevented as the re-registration 
would be denied. Also, we already expose this edge-case slave inconsistency to 
frameworks in other situations without the Registrar.

  was:
We've observed this CHECK failure in production when the following situation 
occurs:

1. Slave asks to Register with Master.
2. Master adds slave with ID 1 and sends acknowledgment.
3. Acknowledgement to the slave is dropped due to one-way partition.
4. Slave continues to retry.
5. Master detects socket closure on slave, marks slave as disconnected.
6. Slave did not exit, re-detects Master, and asks to Register.
7. Master::registerSlave decides to remove "old disconnected slave".
BUG: Master::removeSlave does not remove the old slave from the allocator!
8. Master::registerSlave adds slave with ID 2 and sends acknowledgement.
9. Slave receives ID 1 acknowledgement, and checkpoints.
10. Slave receives ID 2 acknowledgement, and exits from mismatch.
11. Slave recovers and attempts to re-register with checkpointed ID 1.
12. Master allows this (no Registrar yet), and attempts to add the slave to the 
allocator (because of BUG above, CHECK fails in the allocator).

The first bug here is that the Master does not remove a slave from the 
allocator in Master::removeSlave if the slave is disconnected! This was likely 
a regression when Allocator::slaveDisconnected was introduced, and we neglected 
to make the necessary update to Master::removeSlave. This is an easy fix.

The second bug is that the Slave's ID was inconsistent with the Master, and the 
slave exited, only to re-register with the inconsistent ID. If the above bug is 
fixed, this means we'll allow the slave to re-register in the Master after 
having told frameworks the slave is lost. I'm tempted to punt on this bug since 
with the Registrar, this situation would be prevented as the re-registration 
would be denied. Also, we already expose this edge-case slave inconsistency to 
frameworks in other situations without the Registrar.


> Master CHECK failure: hierarchical_allocator_process.hpp:421 Check failed: 
> !slaves.contains(slaveId)
> ----------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-1058
>                 URL: https://issues.apache.org/jira/browse/MESOS-1058
>             Project: Mesos
>          Issue Type: Bug
>          Components: master, slave
>    Affects Versions: 0.17.0, 0.16.0, 0.15.0, 0.18.0
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>             Fix For: 0.19.0
>
>
> We've observed this CHECK failure in production when the following situation 
> occurs:
> 1. Slave asks to Register with Master.
> 2. Master adds slave with ID 1 and sends acknowledgment.
> 3. Acknowledgement to the slave is dropped due to one-way partition.
> 4. Slave continues to retry.
> 5. Master detects spurious socket closure on slave, marks slave as 
> disconnected.
> 6. Slave did not exit, re-detects Master, and asks to Register.
> 7. Master::registerSlave decides to remove "old disconnected slave".
> BUG: Master::removeSlave does not remove the old slave from the allocator!
> 8. Master::registerSlave adds slave with ID 2 and sends acknowledgement.
> 9. Slave receives ID 1 acknowledgement, and checkpoints.
> 10. Slave receives ID 2 acknowledgement, and exits from mismatch.
> 11. Slave recovers and attempts to re-register with checkpointed ID 1.
> 12. Master allows this (no Registrar yet), and attempts to add the slave to 
> the allocator (because of BUG above, CHECK fails in the allocator).
> The first bug here is that the Master does not remove a slave from the 
> allocator in Master::removeSlave if the slave is disconnected! This was 
> likely a regression when Allocator::slaveDisconnected was introduced, and we 
> neglected to make the necessary update to Master::removeSlave. This is an 
> easy fix.
> The second bug is that the Slave's ID was inconsistent with the Master, and 
> the slave exited, only to re-register with the inconsistent ID. If the above 
> bug is fixed, this means we'll allow the slave to re-register in the Master 
> after having told frameworks the slave is lost. I'm tempted to punt on this 
> bug since with the Registrar, this situation would be prevented as the 
> re-registration would be denied. Also, we already expose this edge-case slave 
> inconsistency to frameworks in other situations without the Registrar.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to