Alex Clemmer created MESOS-6803:
-----------------------------------
Summary: Agent authentication does not have an initial `delay`
Key: MESOS-6803
URL: https://issues.apache.org/jira/browse/MESOS-6803
Project: Mesos
Issue Type: Bug
Components: agent
Reporter: Alex Clemmer
Assignee: Alex Clemmer
When an agent registers, there is currently a somewhat subtle difference in
behavior between the cases when it does and does not authenticate:
* In the case that it DOES NOT authenticate, we will choose a random time
between 0 and the agent `registration_backoff_factor` to initiate registration.
The reason for this is to avoid every agent hitting the master at once during
master failover. (We also employ backoff to help this.) See: [1]
* In the case that it DOES authenticate, we always attempt to authenticate and
register the Agent immediately. So currently in authenticated clusters, after
failover, all agents will immediately try to register with a master upon
failover; though, this is helped somewhat by the fact that the authenticated
codepath still uses backoff. See: [2]
It is important to resolve this disparity, not only to make the system more
resilient, but also because it directly blocks us from passing many tests on
platforms where authentication is not supported at all (Windows in particular).
For some time, we have meant to make both the authenticated and unauthenticated
codepaths use a random `delay` to begin. See Adam's TODO in [3]. Historically,
people seem to have had a few problems with this:
1. Deep in the bowels of git history, Vinod notes[4] that the Agent might end
up trying to authenticate twice, if a new master is detected before the auth is
processed. It seems to me that this should not be an issue (or at least, not
any more).
2. Many of our tests depend on authenticated registration happening even if
`Clock::pause()` has been called; that is, because our first attempt at
authentication and Agent registration are dispatched for immediate execution,
even when we pause the clock, these events should still happen. If we use a
`delay`, then they are scheduled to happen in the future, and any tests
employing `Clock::pause` during this time will fail.
The resolution of this bug, at minimum, involves fixing the semantics of the
above tests to pass when `HAS_AUTHENTICATION` is set to false. Following this,
it is realistic to expect that we add `delay` to the authentication codepath as
well.
In terms of resolution, it is useful to know the specific tests that will fail
if `HAS_AUTHENTICATION` is set to false:
```
[ FAILED ] ExamplesTest.V1JavaFramework
[ FAILED ] ExamplesTest.PythonFramework
[ FAILED ] FaultToleranceTest.FrameworkReregister
[ FAILED ] MasterAllocatorTest/0.RebalancedForUpdatedWeights, where TypeParam
=
mesos::internal::master::allocator::MesosAllocator<mesos::internal::master::allocator::HierarchicalAllocatorProcess<mesos::internal::master::allocator::DRFSorter,
mesos::internal::master::allocator::DRFSorter,
mesos::internal::master::allocator::DRFSorter> >
[ FAILED ] MasterAllocatorTest/1.RebalancedForUpdatedWeights, where TypeParam
= mesos::internal::tests::Module<mesos::allocator::Allocator,
(mesos::internal::tests::ModuleID)6>
[ FAILED ] MasterTest.EndpointsForHalfRemovedSlave
[ FAILED ] MasterTest.UnreachableTaskAfterFailover
[ FAILED ] MasterTest.CancelRecoveredSlaveRemoval
[ FAILED ] MasterTest.RecoveredFramework
[ FAILED ] OversubscriptionTest.RescindRevocableOfferWithIncreasedRevocable
[ FAILED ] OversubscriptionTest.RescindRevocableOfferWithDecreasedRevocable
[ FAILED ] OversubscriptionTest.Reregistration
[ FAILED ] PartitionTest.ReregisterSlavePartitionAware
[ FAILED ] PartitionTest.ReregisterSlaveNotPartitionAware
[ FAILED ] PartitionTest.PartitionedSlaveReregistrationMasterFailover
[ FAILED ] PartitionTest.PartitionedSlaveOrphanedTask
[ FAILED ] PartitionTest.SpuriousSlaveReregistration
[ FAILED ] PartitionTest.PartitionedSlaveStatusUpdates
[ FAILED ] PartitionTest.RegistryGcByCount
[ FAILED ] PartitionTest.RegistryGcByAge
[ FAILED ] PartitionTest.RegistryGcRace
[ FAILED ] OneWayPartitionTest.MasterToSlave
[ FAILED ] ReconciliationTest.ReconcileStatusUpdateTaskState
[ FAILED ] ReservationTest.ACLMultipleOperations
[ FAILED ] ReservationTest.WithoutAuthenticationWithoutPrincipal
[ FAILED ] ReservationTest.WithoutAuthenticationWithPrincipal
[ FAILED ] SlaveTest.DuplicateTerminalUpdateBeforeAck
[ FAILED ] SlaveTest.StateEndpoint
[ FAILED ] SlaveTest.PingTimeoutNoPings
[ FAILED ] SlaveTest.PingTimeoutSomePings
[ FAILED ] SlaveTest.ReregisterWithStatusUpdateTaskState
[ FAILED ] SlaveTest.MaxCompletedExecutorsPerFrameworkFlag
[ FAILED ] ContentType/AgentAPITest.NestedContainerLaunchFalse/0, where
GetParam() = application/x-protobuf
[ FAILED ] ContentType/AgentAPITest.NestedContainerLaunchFalse/1, where
GetParam() = application/json
[ FAILED ] ContentType/AgentAPITest.NestedContainerLaunch/0, where GetParam()
= application/x-protobuf
[ FAILED ] ContentType/AgentAPITest.NestedContainerLaunch/1, where GetParam()
= application/json
[ FAILED ]
ContentType/AgentAPITest.LaunchNestedContainerSessionAttachFailure/0, where
GetParam() = application/x-protobuf
[ FAILED ]
ContentType/AgentAPITest.LaunchNestedContainerSessionAttachFailure/1, where
GetParam() = application/json
[ FAILED ] DiskResource/PersistentVolumeTest.MasterFailover/0, where
GetParam() = 0
[ FAILED ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/0, where
GetParam() = 0
[ FAILED ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/1, where
GetParam() = 1
[ FAILED ]
DiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/0,
where GetParam() = 0
[ FAILED ]
DiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/1,
where GetParam() = 1
[ FAILED ] MountDiskResource/PersistentVolumeTest.AccessPersistentVolume/0,
where GetParam() = 2
[ FAILED ]
MountDiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/0,
where GetParam() = 2
```
[1]
https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L948
[2]
https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L942
[3]
https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L938
[4]
https://github.com/apache/mesos/commit/09b1dc3e95955aa187458fcb61e1d66b04ec3af2#diff-01648193f4029dc9fc1e024949f6ea28R562
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)