-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/
-----------------------------------------------------------
(Updated Feb. 12, 2017, 2:49 p.m.)
Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and
Stephan Erb.
Changes
-------
Now limiting retrieved terminated tasks to a job scope to
(1) decrease heap pressure, and
(2) eliminate full-scan of `MemTaskStore`
Bugs: AURORA-1837
https://issues.apache.org/jira/browse/AURORA-1837
Repository: aurora
Description
-------
This patch addressed efficiency issues in the current implementation of
`TaskHistoryPruner`. The new design is similar to that of
`JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run upon
terminal task state transitions, it runs on preconfigured intervals, finds all
terminal state tasks that meet pruning criteria and deletes them. (b) Makes the
initial task history pruning delay configurable so that it does not hamper
scheduler upon start.
The new design addressed the following two efficiecy problems:
1. Upon scheduler restart/failure, the in-memory state of task history pruning
scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns about these
dead tasks upon restart when log is replayed. These expired tasks are picked up
by the second call to `executor.execute()` that performs job level pruning
immediately (i.e., without delay). Hence, most task history pruning happens
after scheduler restarts and can severely hamper scheduler performance (or
cause consecutive fail-overs on test clusters when we put load test on
scheduler).
2. Expired tasks can be picked up for pruning multiple times. The asynchronous
nature of `BatchWorker` which used to process task deletions introduces some
delay between delete enqueue and delete execution. As a result, tasks already
queued for deletion in a previous evaluation round might get picked up,
evaluated and enqueued for deletion again. This is evident in `tasks_pruned`
metric which reflects numbers much higher than the actual number of expired
tasks deleted.
Diffs (updated)
-----
src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java
735199ac1ccccab343c24471890aa330d6635c26
src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java
f77849498ff23616f1d56d133eb218f837ac3413
src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java
14e4040e0b94e96f77068b41454311fa3bf53573
Diff: https://reviews.apache.org/r/56575/diff/
Testing
-------
Manual testing under Vagrant
Thanks,
Mehrdad Nurolahzade