[
https://issues.apache.org/jira/browse/MESOS-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adam B updated MESOS-6803:
--------------------------
Target Version/s: 1.2.0
> Agent authentication does not have an initial `delay`
> -----------------------------------------------------
>
> Key: MESOS-6803
> URL: https://issues.apache.org/jira/browse/MESOS-6803
> Project: Mesos
> Issue Type: Bug
> Components: agent, scheduler driver
> Reporter: Alex Clemmer
> Assignee: Alex Clemmer
> Priority: Critical
> Labels: microsoft, security, windows-mvp
>
> When an agent registers, there is currently a somewhat subtle difference in
> behavior between the cases when it does and does not authenticate:
> * In the case that it DOES NOT authenticate, we will choose a random time
> between 0 and the agent `registration_backoff_factor` to initiate
> registration. The reason for this is to avoid every agent hitting the master
> at once during master failover. (We also employ backoff to help this.) See:
> [1]
> * In the case that it DOES authenticate, we always attempt to authenticate
> and register the Agent immediately. So currently in authenticated clusters,
> after failover, all agents will immediately try to register with a master
> upon failover; though, this is helped somewhat by the fact that the
> authenticated codepath still uses backoff. See: [2]
> It is important to resolve this disparity, not only to make the system more
> resilient, but also because it directly blocks us from passing many tests on
> platforms where authentication is not supported at all (Windows in
> particular).
> For some time, we have meant to make both the authenticated and
> unauthenticated codepaths use a random `delay` to begin. See Adam's TODO in
> [3]. Historically, people seem to have had a few problems with this:
> 1. Deep in the bowels of git history, Vinod notes[4] that the Agent might end
> up trying to authenticate twice, if a new master is detected before the auth
> is processed. It seems to me that this should not be an issue (or at least,
> not any more).
> 2. Many of our tests depend on authenticated registration happening even if
> `Clock::pause()` has been called; that is, because our first attempt at
> authentication and Agent registration are dispatched for immediate execution,
> even when we pause the clock, these events should still happen. If we use a
> `delay`, then they are scheduled to happen in the future, and any tests
> employing `Clock::pause` during this time will fail.
> The resolution of this bug, at minimum, involves fixing the semantics of the
> above tests to pass when `HAS_AUTHENTICATION` is set to false. Following
> this, it is realistic to expect that we add `delay` to the authentication
> codepath as well.
> In terms of resolution, it is useful to know the specific tests that will
> fail if `HAS_AUTHENTICATION` is set to false:
> ```
> [ FAILED ] ExamplesTest.V1JavaFramework
> [ FAILED ] ExamplesTest.PythonFramework
> [ FAILED ] FaultToleranceTest.FrameworkReregister
> [ FAILED ] MasterAllocatorTest/0.RebalancedForUpdatedWeights, where
> TypeParam =
> mesos::internal::master::allocator::MesosAllocator<mesos::internal::master::allocator::HierarchicalAllocatorProcess<mesos::internal::master::allocator::DRFSorter,
> mesos::internal::master::allocator::DRFSorter,
> mesos::internal::master::allocator::DRFSorter> >
> [ FAILED ] MasterAllocatorTest/1.RebalancedForUpdatedWeights, where
> TypeParam = mesos::internal::tests::Module<mesos::allocator::Allocator,
> (mesos::internal::tests::ModuleID)6>
> [ FAILED ] MasterTest.EndpointsForHalfRemovedSlave
> [ FAILED ] MasterTest.UnreachableTaskAfterFailover
> [ FAILED ] MasterTest.CancelRecoveredSlaveRemoval
> [ FAILED ] MasterTest.RecoveredFramework
> [ FAILED ] OversubscriptionTest.RescindRevocableOfferWithIncreasedRevocable
> [ FAILED ] OversubscriptionTest.RescindRevocableOfferWithDecreasedRevocable
> [ FAILED ] OversubscriptionTest.Reregistration
> [ FAILED ] PartitionTest.ReregisterSlavePartitionAware
> [ FAILED ] PartitionTest.ReregisterSlaveNotPartitionAware
> [ FAILED ] PartitionTest.PartitionedSlaveReregistrationMasterFailover
> [ FAILED ] PartitionTest.PartitionedSlaveOrphanedTask
> [ FAILED ] PartitionTest.SpuriousSlaveReregistration
> [ FAILED ] PartitionTest.PartitionedSlaveStatusUpdates
> [ FAILED ] PartitionTest.RegistryGcByCount
> [ FAILED ] PartitionTest.RegistryGcByAge
> [ FAILED ] PartitionTest.RegistryGcRace
> [ FAILED ] OneWayPartitionTest.MasterToSlave
> [ FAILED ] ReconciliationTest.ReconcileStatusUpdateTaskState
> [ FAILED ] ReservationTest.ACLMultipleOperations
> [ FAILED ] ReservationTest.WithoutAuthenticationWithoutPrincipal
> [ FAILED ] ReservationTest.WithoutAuthenticationWithPrincipal
> [ FAILED ] SlaveTest.DuplicateTerminalUpdateBeforeAck
> [ FAILED ] SlaveTest.StateEndpoint
> [ FAILED ] SlaveTest.PingTimeoutNoPings
> [ FAILED ] SlaveTest.PingTimeoutSomePings
> [ FAILED ] SlaveTest.ReregisterWithStatusUpdateTaskState
> [ FAILED ] SlaveTest.MaxCompletedExecutorsPerFrameworkFlag
> [ FAILED ] ContentType/AgentAPITest.NestedContainerLaunchFalse/0, where
> GetParam() = application/x-protobuf
> [ FAILED ] ContentType/AgentAPITest.NestedContainerLaunchFalse/1, where
> GetParam() = application/json
> [ FAILED ] ContentType/AgentAPITest.NestedContainerLaunch/0, where
> GetParam() = application/x-protobuf
> [ FAILED ] ContentType/AgentAPITest.NestedContainerLaunch/1, where
> GetParam() = application/json
> [ FAILED ]
> ContentType/AgentAPITest.LaunchNestedContainerSessionAttachFailure/0, where
> GetParam() = application/x-protobuf
> [ FAILED ]
> ContentType/AgentAPITest.LaunchNestedContainerSessionAttachFailure/1, where
> GetParam() = application/json
> [ FAILED ] DiskResource/PersistentVolumeTest.MasterFailover/0, where
> GetParam() = 0
> [ FAILED ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/0,
> where GetParam() = 0
> [ FAILED ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/1,
> where GetParam() = 1
> [ FAILED ]
> DiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/0,
> where GetParam() = 0
> [ FAILED ]
> DiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/1,
> where GetParam() = 1
> [ FAILED ] MountDiskResource/PersistentVolumeTest.AccessPersistentVolume/0,
> where GetParam() = 2
> [ FAILED ]
> MountDiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/0,
> where GetParam() = 2
> ```
> [1]
> https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L948
> [2]
> https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L942
> [3]
> https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L938
> [4]
> https://github.com/apache/mesos/commit/09b1dc3e95955aa187458fcb61e1d66b04ec3af2#diff-01648193f4029dc9fc1e024949f6ea28R562
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)