[ 
https://issues.apache.org/jira/browse/AURORA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098318#comment-15098318
 ] 

Bill Farner commented on AURORA-1580:
-------------------------------------

I was able to validate a hypothesis and reproduce this with a test case.  The 
code below uses a reader and writer thread to manipulate a task and verify the 
value returned by the task store.

This is ultimately the result of using {{LOCK_MODE 0}} (read uncommitted) in h2 
\[[1|http://h2database.com/html/advanced.html#transaction_isolation]\].
This mode was enabled in AURORA-1386, however AURORA-1395 should have addressed 
the issue leading to AURORA-1386 in a more appropriate way.
I have not been able to trigger this issue with {{LOCK_MODE 1}} (serializable), 
however i think we really need {{LOCK_MODE 3}} (read committed) for reasonable 
performance.  Unfortunately the test case will occasionally fail with this lock 
mode, presumably because we do not identify reads that are part of a 
read-modify-write (with {{SELECT ... FOR UPDATE}}).

{code}
  @Test
  public void testReadSeparateThread() throws Exception {
    saveTasks(TASK_A);
    Query.Builder query = Query.taskScoped(Tasks.id(TASK_A));
    ITaskEvent event = ITaskEvent.build(new TaskEvent()
        .setTimestamp(System.currentTimeMillis())
        .setMessage("message")
        .setStatus(ScheduleStatus.FINISHED));

    ExecutorService executor = Executors.newCachedThreadPool();
    executor.execute(new Runnable() {
      @Override
      public void run() {
        for (int i = 0; i < 1000; i++) {
          try {
            ScheduledTask builder = 
Iterables.getOnlyElement(fetchTasks(query)).newBuilder();
            switch (builder.getTaskEventsSize()) {
              case 1:
                builder.addToTaskEvents(event.newBuilder());
                saveTasks(IScheduledTask.build(builder));
                break;

              case 2:
                builder.getTaskEvents().remove(1);
                saveTasks(IScheduledTask.build(builder));
                break;

              default:
                System.out.println("WRITER - INVALID DATA: " + builder);
            }

            Thread.currentThread().sleep(1);
          } catch (InterruptedException e) {
            e.printStackTrace();
            Thread.currentThread().interrupt();
          }
        }
      }
    });
    Thread.currentThread().sleep(1000);

    executor.execute(new Runnable() {
      @Override
      public void run() {
        for (int i = 0; i < 1000; i++) {
          try {
            IScheduledTask task = Iterables.getOnlyElement(fetchTasks(query));
            if (task.getTaskEvents().size() == 1) {
              // No-op.
            } else if (task.getTaskEvents().size() == 2) {
              // No-op.
            } else {
              System.out.println("READER - INVALID DATA: " + task);
            }

            Thread.currentThread().sleep(1);
          } catch (InterruptedException e) {
            e.printStackTrace();
            Thread.currentThread().interrupt();
          }
        }
      }
    });

    executor.shutdown();
    if (!executor.awaitTermination(2, TimeUnit.MINUTES)) {
      executor.shutdownNow();
    }
  }
{code}

> java.util.NoSuchElementException from Tasks.getLatestEvent with DbTaskStore
> ---------------------------------------------------------------------------
>
>                 Key: AURORA-1580
>                 URL: https://issues.apache.org/jira/browse/AURORA-1580
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: Zameer Manji
>
> I have discovered the following exception from a scheduler that is running 
> off master with the beta task store enabled.
> {noformat}
> E0113 22:51:55.941 [AsyncProcessor-2, AsyncUtil:123] 
> java.util.concurrent.ExecutionException: java.util.NoSuchElementException 
> java.util.concurrent.ExecutionException: java.util.NoSuchElementException
>         at java.util.concurrent.FutureTask.report(FutureTask.java:122) 
> ~[na:1.8.0_66-Tw8r9b1]
>         at java.util.concurrent.FutureTask.get(FutureTask.java:192) 
> ~[na:1.8.0_66-Tw8r9b1]
>         at 
> org.apache.aurora.scheduler.base.AsyncUtil.evaluateResult(AsyncUtil.java:118) 
> [aurora-110.jar:na]
>         at 
> org.apache.aurora.scheduler.base.AsyncUtil.access$000(AsyncUtil.java:32) 
> [aurora-110.jar:na]
>         at 
> org.apache.aurora.scheduler.base.AsyncUtil$1.afterExecute(AsyncUtil.java:59) 
> [aurora-110.jar:na]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1150)
>  [na:1.8.0_66-Tw8r9b1]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_66-Tw8r9b1]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_66-Tw8r9b1]
> Caused by: java.util.NoSuchElementException: null
>         at com.google.common.collect.Iterables.getLast(Iterables.java:784) 
> ~[guava-19.0.jar:na]
>         at 
> org.apache.aurora.scheduler.base.Tasks.getLatestEvent(Tasks.java:149) 
> ~[aurora-110.jar:na]
>         at org.apache.aurora.scheduler.base.Tasks$1.apply(Tasks.java:156) 
> ~[aurora-110.jar:na]
>         at org.apache.aurora.scheduler.base.Tasks$1.apply(Tasks.java:153) 
> ~[aurora-110.jar:na]
>         at 
> com.google.common.collect.ByFunctionOrdering.compare(ByFunctionOrdering.java:45)
>  ~[guava-19.0.jar:na]
>         at java.util.TimSort.binarySort(TimSort.java:296) 
> ~[na:1.8.0_66-Tw8r9b1]
>         at java.util.TimSort.sort(TimSort.java:239) ~[na:1.8.0_66-Tw8r9b1]
>         at java.util.Arrays.sort(Arrays.java:1438) ~[na:1.8.0_66-Tw8r9b1]
>         at com.google.common.collect.Ordering.sortedCopy(Ordering.java:860) 
> ~[guava-19.0.jar:na]
>         at 
> org.apache.aurora.scheduler.pruning.TaskHistoryPruner.lambda$registerInactiveTask$20(TaskHistoryPruner.java:156)
>  ~[aurora-110.jar:na]
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_66-Tw8r9b1]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> ~[na:1.8.0_66-Tw8r9b1]
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  ~[na:1.8.0_66-Tw8r9b1]
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  ~[na:1.8.0_66-Tw8r9b1]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [na:1.8.0_66-Tw8r9b1]
>         ... 2 common frames omitted
> {noformat}
> Similar exception occurs within the preemptor and causes the scheduler to 
> crash:
> {noformat}
> E0113 01:43:06.242 THREAD5037 
> com.google.common.util.concurrent.ServiceManager$ServiceListener.failed: 
> Service PreemptorService [FAILED] has failed in the RUNNING state.
> java.util.NoSuchElementException
>         at com.google.common.collect.Iterables.getLast(Iterables.java:784)
>         at 
> org.apache.aurora.scheduler.base.Tasks.getLatestEvent(Tasks.java:149)
>         at 
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor$3.apply(PendingTaskProcessor.java:237)
>         at 
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor$3.apply(PendingTaskProcessor.java:234)
>         at 
> com.google.common.base.Predicates$AndPredicate.apply(Predicates.java:374)
>         at 
> com.google.common.collect.Iterators$7.computeNext(Iterators.java:675)
>         at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>         at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>         at 
> com.google.common.collect.TransformedIterator.hasNext(TransformedIterator.java:43)
>         at com.google.common.collect.Iterators.addAll(Iterators.java:364)
>         at com.google.common.collect.Iterables.addAll(Iterables.java:352)
>         at com.google.common.collect.HashMultiset.create(HashMultiset.java:66)
>         at 
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.fetchIdlePendingGroups(PendingTaskProcessor.java:183)
>         at 
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.lambda$run$140(PendingTaskProcessor.java:144)
>         at 
> org.apache.aurora.scheduler.storage.db.DbStorage.read(DbStorage.java:138)
>         at 
> org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
>         at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
>         at 
> org.apache.aurora.scheduler.storage.log.LogStorage.read(LogStorage.java:570)
>         at 
> org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.read(CallOrderEnforcingStorage.java:113)
>         at 
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.run(PendingTaskProcessor.java:119)
>         at 
> org.apache.aurora.scheduler.preemptor.PreemptorModule$PreemptorService.runOneIteration(PreemptorModule.java:145)
>         at 
> com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:189)
>         at 
> com.google.common.util.concurrent.Callables$3.run(Callables.java:100)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Background:
> I discovered this while running a load test on a small test cluster. This 
> exception kept on occurring, preventing any tasks from being pruned from the 
> dbtask store. This exception also kept on killing the preemptor, causing a 
> new scheduler to be elected. Eventually the number of tasks stored for this 
> job reached 8k+ and there was a general slowdown observed across the entire 
> system. After a certain point, the scheduler wasn't able to register within 
> the 1minute registration timeout, causing the entire cluster to stop working 
> until I raised the timeout to 5 minutes, and then killed the job preventing 
> more tasks from being created.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to