----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/56575/#review165419 -----------------------------------------------------------
After testing this patch on our test clusters, I can see that this implementation is generating lots of garbage that causes heap pressure. This is happening because filtering takes place in memory. Hence, even if no tasks are to be pruned, still lots of garbage is generated due to filtering. To avoid in-memory filtering, I am going to experiment with moving the pruning logic into `TaskStore`; similar to how it is currently done in `JobUpdateStore.pruneHistory()`. - Mehrdad Nurolahzade On Feb. 13, 2017, 9:30 a.m., Mehrdad Nurolahzade wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/56575/ > ----------------------------------------------------------- > > (Updated Feb. 13, 2017, 9:30 a.m.) > > > Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and > Stephan Erb. > > > Bugs: AURORA-1837 > https://issues.apache.org/jira/browse/AURORA-1837 > > > Repository: aurora > > > Description > ------- > > This patch addressed efficiency issues in the current implementation of > `TaskHistoryPruner`. The new design is similar to that of > `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run > upon terminal task state transitions, it runs on preconfigured intervals, > finds all terminal state tasks that meet pruning criteria and deletes them. > (b) Makes the initial task history pruning delay configurable so that it does > not hamper scheduler upon start. > > The new design addressed the following two efficiecy problems: > > 1. Upon scheduler restart/failure, the in-memory state of task history > pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns > about these dead tasks upon restart when log is replayed. These expired tasks > are picked up by the second call to `executor.execute()` that performs job > level pruning immediately (i.e., without delay). Hence, most task history > pruning happens after scheduler restarts and can severely hamper scheduler > performance (or cause consecutive fail-overs on test clusters when we put > load test on scheduler). > > 2. Expired tasks can be picked up for pruning multiple times. The > asynchronous nature of `BatchWorker` which used to process task deletions > introduces some delay between delete enqueue and delete execution. As a > result, tasks already queued for deletion in a previous evaluation round > might get picked up, evaluated and enqueued for deletion again. This is > evident in `tasks_pruned` metric which reflects numbers much higher than the > actual number of expired tasks deleted. > > > Diffs > ----- > > src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java > 735199ac1ccccab343c24471890aa330d6635c26 > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java > f77849498ff23616f1d56d133eb218f837ac3413 > > src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java > 14e4040e0b94e96f77068b41454311fa3bf53573 > > Diff: https://reviews.apache.org/r/56575/diff/ > > > Testing > ------- > > Manual testing under Vagrant > > > Thanks, > > Mehrdad Nurolahzade > >
