[jira] [Commented] (AURORA-1952) race condition in offers by agent id map (and potentially others) caused(probably) a crash

Jordan Ly (JIRA) Thu, 19 Oct 2017 14:08:28 -0700

    [ 
https://issues.apache.org/jira/browse/AURORA-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211737#comment-16211737
 ]


Jordan Ly commented on AURORA-1952:
-----------------------------------

Hi [~mohitjaggi]!

It is possible for Mesos to send multiple offers for the same host in busy 
scenarios. We acknowledge this in `OfferManagerImpl::addOffer`:

{noformat}
      // We run a slight risk of a race here, which is acceptable.  The worst 
case is that we
      // temporarily hold two offers for the same host, which should be 
corrected when we return
      // them after the return delay.
{noformat}

One scenario that can occur is that we hold two offers, but a rescind is issued 
for one of them before we can compact them using return delay.

1. Aurora gets offer_1 for host_a
2. Aurora gets offer_2 for host_a
3. Aurora gets a rescind for offer_1
4. We remove offer_1 from HostOffers, but we also remove the AgentID and HostID 
from their respective maps.
5. Aurora gets offer_3 for host_a and it seems like we do not have any offers 
from that agent, despite the fact that we still have offer_2. We add the offer 
and thus have 2 offers for the same agent.

One thing we can do is ensure that `HostOffers::remove` takes into account the 
value to remove as well:

{noformat}
          offersBySlave.remove(removed.getOffer().getAgentId()); // Add `, 
removed` to args
          offersByHost.remove(removed.getOffer().getHostname());  // Add `, 
removed` to args
{noformat}

The race condition will still exist (but with a smaller probability). A longer 
term solution would be to make `addOffer` synchronous in the sense that it will 
decline the two offers for the same host no matter what. We are actually 
running into this problem here: https://reviews.apache.org/r/63157/ so your 
issue may be solved as a side-effect of this.

> race condition in offers by agent id map (and potentially others) 
> caused(probably) a crash
> ------------------------------------------------------------------------------------------
>
>                 Key: AURORA-1952
>                 URL: https://issues.apache.org/jira/browse/AURORA-1952
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 0.18.0
>         Environment: nothing special
>            Reporter: Mohit Jaggi
>            Assignee: Mohit Jaggi
>             Fix For: 0.18.0
>
>
> Crashed here
> https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/preemptor/PendingTaskProcessor.java#L145
> due to duplicates in map. Most likely a concurrency issue. [~wfarner] pointed 
> out the following code:
> I'm looking at this chunk here, where a concurrent map would not help.
>       Optional<HostOffer> sameSlave = 
> hostOffers.get(offer.getOffer().getAgentId());
>       if (sameSlave.isPresent()) {
>         // If there are existing offers for the slave, decline all of them so 
> the master can
>         // compact all of those offers into a single offer and send them back.
>         LOG.info("Returning offers for " + 
> offer.getOffer().getAgentId().getValue()
>             + " for compaction.");
>         decline(offer.getOffer().getId());
>         removeAndDecline(sameSlave.get().getOffer().getId());
>       } else {
>         hostOffers.add(offer);
> --------- logs ----------
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: Sep 28, 2017 6:09:00 PM 
> com.google.common.util.concurrent.ServiceManager$ServiceListener failed
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: SEVERE: Service 
> PreemptorService [FAILED] has failed in the RUNNING state.
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: 
> java.lang.IllegalArgumentException: Multiple entries with same key: 
> 1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102=HostOffer{offer=id {
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: value: 
> "f7b84805-a0c5-4405-be77-f7f1b7110405-O56597202"
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: }
> ...
> ...
> ep 28 18:09:00 machine1163 aurora-scheduler[14266]: , 
> hostAttributes=IHostAttributes{host=compute606-dca1.prod.uber.internal, 
> attributes=[IAttribute{name=host, values=[compute606-dca1]}, 
> IAttribute{name=rack, values=[as13]}, IAttribute{name=pod, values=[d]}, 
> IAttribute{name=dedicated, values=[infra/cassandra]}], mode=NONE, 
> slaveId=1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102}}. To index multiple values 
> under a key, use Multimaps.index.
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> com.google.common.collect.Maps.uniqueIndex(Maps.java:1251)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> com.google.common.collect.Maps.uniqueIndex(Maps.java:1208)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.lambda$run$0(PendingTaskProcessor.java:146)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.scheduler.storage.db.DbStorage.read(DbStorage.java:147)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.scheduler.storage.log.LogStorage.read(LogStorage.java:562)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.read(CallOrderEnforcingStorage.java:113)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.run(PendingTaskProcessor.java:135)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.scheduler.preemptor.PreemptorModule$PreemptorService.runOneIteration(PreemptorModule.java:161)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:188)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> java.lang.Thread.run(Thread.java:748)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: E0928 18:09:00.316 
> [PreemptorService RUNNING, GuavaUtils$LifecycleShutdownListener:55] Service: 
> PreemptorService [FAILED] failed unexpectedly. Triggering shutdown.
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316 
> [qtp1000734462-3068369, Slf4jRequestLog:60] 10.187.28.19 - - 
> [28/Sep/2017:18:09:00 +0000] "POST //10.188.43.6:8082/api HTTP/1.1" 200 95
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316 
> [PreemptorService RUNNING, Lifecycle:84] Shutting down application
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316 
> [PreemptorService RUNNING, ShutdownRegistry$ShutdownRegistryImpl:77] 
> Executing 4 shutdown commands.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1952) race condition in offers by agent id map (and potentially others) caused(probably) a crash

Reply via email to