Jordan Ly created AURORA-1945:
---------------------------------
Summary: Rescinds received but not processed in time before offer
accept
Key: AURORA-1945
URL: https://issues.apache.org/jira/browse/AURORA-1945
Project: Aurora
Issue Type: Bug
Components: Scheduler
Reporter: Jordan Ly
Assignee: Jordan Ly
Priority: Minor
The current race condition for offers is possible:
# Scheduler receives an offer and adds it to the executor queue for processing.
# The executor processes the offer and adds it to the HostOffers list.
# Scheduler receives a rescind for that offer and adds it to the executor queue
for processing. However, there is a lot of load on the executor so there might
be a delay between receiving the rescind and processing it.
# Scheduler accepts the offer before the rescind is processed by the executor.
This will result in launching a task with an invalid offer leading to TASK_LOST.
The following logs show this in action:
Mesos:
{noformat}
I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with
revocable resources...
W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X
since it is no longer valid
W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[
OFFER_X ]': Offer OFFER_X is no longer valid
I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST
for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
{noformat}
Aurora:
{noformat}
I0810 14:28:45.676 [SchedulerImpl-0,
MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X
I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService]
Accepting offer OFFER_X with ops [LAUNCH]
I0810 14:34:24.186 [Thread-4471585,
MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task
TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task
launched with invalid offers: Offer_X is no longer valid
I0810 14:34:32.972 [SchedulerImpl-0,
MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to
cancel offer: OFFER_X.
{noformat}
We should find a way to prioritize/process rescinds immediately to avoid this
delay. We should also take into account the previous race condition fixed by
[AURORA-1933|https://issues.apache.org/jira/browse/AURORA-1933] so we do not
repeat that as well.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)