[ 
https://issues.apache.org/jira/browse/AURORA-1945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136097#comment-16136097
 ] 

Jordan Ly commented on AURORA-1945:
-----------------------------------

https://reviews.apache.org/r/61804/

> Rescinds received but not processed in time before offer accept
> ---------------------------------------------------------------
>
>                 Key: AURORA-1945
>                 URL: https://issues.apache.org/jira/browse/AURORA-1945
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: Jordan Ly
>            Assignee: Jordan Ly
>            Priority: Minor
>
> The current race condition for offers is possible:
> # Scheduler receives an offer and adds it to the executor queue for 
> processing.
> # The executor processes the offer and adds it to the HostOffers list.
> # Scheduler receives a rescind for that offer and adds it to the executor 
> queue for processing. However, there is a lot of load on the executor so 
> there might be a delay between receiving the rescind and processing it.
> # Scheduler accepts the offer before the rescind is processed by the 
> executor. This will result in launching a task with an invalid offer leading 
> to TASK_LOST.
> The following logs show this in action:
> Mesos:
> {noformat}
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with 
> revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X 
> since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers 
> '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST 
> for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> {noformat}
> Aurora:
> {noformat}
> I0810 14:28:45.676 [SchedulerImpl-0, 
> MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] 
> Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, 
> MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for 
> task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: 
> Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, 
> MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to 
> cancel offer: OFFER_X. 
> {noformat}
> We should find a way to prioritize/process rescinds immediately to avoid this 
> delay. We should also take into account the previous race condition fixed by 
> [AURORA-1933|https://issues.apache.org/jira/browse/AURORA-1933] so we do not 
> repeat that as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to