[ 
https://issues.apache.org/jira/browse/MESOS-9763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863008#comment-16863008
 ] 

Andrei Sekretenko commented on MESOS-9763:
------------------------------------------

In [https://reviews.apache.org/r/70668] the validation of the new FrameworkInfo 
against the current one was moved into the `_subscribe()` continuation (which 
also performs applying the update).  This fixes the race.

No deterministic test against this race has been implemened yet, though.

> Race between two re-subscriptions against an empty master.
> ----------------------------------------------------------
>
>                 Key: MESOS-9763
>                 URL: https://issues.apache.org/jira/browse/MESOS-9763
>             Project: Mesos
>          Issue Type: Bug
>          Components: master, scheduler api
>            Reporter: Andrei Sekretenko
>            Priority: Major
>              Labels: foundations
>
> Currently, subscription (and re-subscription)  is not atomic.
>  It consists of three steps performed by two actors:
>   - Validating the supplied FrameworkInfo against the master state (which 
> possibly includes an existing FrameworkInfo)
>   - Authorizing the (re-)subscribing framework
>   - Applying the update
> A partitioned or buggy (or both) framework can trigger a race by sending two 
> SUBSCRIBE calls with differing FrameworkInfo's on master failover.
> One of the possible sequences of events:
>  1. FrameworkInfo A is validated by master (which has no data about this 
> framework)
>  2. conflicting FrameworkInfo B is validated by master  (which stores no data 
> about this framework as SchedulerA is not even authorized yet)
>  3. Scheduler A is authorized
>  4. Scheduler B is authorized
>  5. FrameworkInfo A is applied
>  6. Master attempts to apply FrameworkInfoB which is no longer valid after 
> the previous step.
> One simple example is an attempt to re-subscribe with two different 
> principals: currently the scheduler B's principal will be silently ignored at 
> step 6 (instead of a validation error sent to B).
> At the moment of writing I'm not sure if there are other problems caused by 
> this race.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to