[ https://issues.apache.org/jira/browse/MESOS-9763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863008#comment-16863008 ]
Andrei Sekretenko commented on MESOS-9763: ------------------------------------------ In [https://reviews.apache.org/r/70668] the validation of the new FrameworkInfo against the current one was moved into the `_subscribe()` continuation (which also performs applying the update). This fixes the race. No deterministic test against this race has been implemened yet, though. > Race between two re-subscriptions against an empty master. > ---------------------------------------------------------- > > Key: MESOS-9763 > URL: https://issues.apache.org/jira/browse/MESOS-9763 > Project: Mesos > Issue Type: Bug > Components: master, scheduler api > Reporter: Andrei Sekretenko > Priority: Major > Labels: foundations > > Currently, subscription (and re-subscription) is not atomic. > It consists of three steps performed by two actors: > - Validating the supplied FrameworkInfo against the master state (which > possibly includes an existing FrameworkInfo) > - Authorizing the (re-)subscribing framework > - Applying the update > A partitioned or buggy (or both) framework can trigger a race by sending two > SUBSCRIBE calls with differing FrameworkInfo's on master failover. > One of the possible sequences of events: > 1. FrameworkInfo A is validated by master (which has no data about this > framework) > 2. conflicting FrameworkInfo B is validated by master (which stores no data > about this framework as SchedulerA is not even authorized yet) > 3. Scheduler A is authorized > 4. Scheduler B is authorized > 5. FrameworkInfo A is applied > 6. Master attempts to apply FrameworkInfoB which is no longer valid after > the previous step. > One simple example is an attempt to re-subscribe with two different > principals: currently the scheduler B's principal will be silently ignored at > step 6 (instead of a validation error sent to B). > At the moment of writing I'm not sure if there are other problems caused by > this race. -- This message was sent by Atlassian JIRA (v7.6.3#76005)