Re: Review Request 56575: AURORA-1837 Improve task history pruning

David McLaughlin Wed, 15 Feb 2017 10:19:52 -0800


> On Feb. 15, 2017, 5:40 p.m., David McLaughlin wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, 
> > lines 150-151
> > <https://reviews.apache.org/r/56575/diff/5/?file=1634553#file1634553line150>
> >
> >     Can you explain the reasoning behind doing this in a loop per job 
> > rather than a single loop that queries all tasks?
> 
> Mehrdad Nurolahzade wrote:
>     That's how I originally did it (see revision 1). The concern expressed by 
> reviewers (read above) was around the following areas:
>     1. We are trying to load/filter all terminal state tasks at once which 
> might cause heap pressure. Looking at one job at a time might be slow and 
> inefficient from a throughput standpoint (which is a secondary concern) but 
> can help with putting less pressure on the heap.
>     2. `MemStore` does not have a scheduling status index therefore it has to 
> sequentially scan all tasks to find matching tasks.


Were those concerns backed by data? Do we have performance numbers from before 
and after that change? It doesn't seem to me like a given that moving from a 
single query to N+1 queries leads to less work and less objects being created. 
Especially when the tasks are already on heap (DbTaskStore is still considered 
experimental and not production ready. We should not optimize for it). 

For point (2) - this does not prevent us scanning every single task in the 
store. It just divides the full scan into multiple queries, and each of those 
queries (and subsequent filtering) has the overhead of objects, streams, sorted 
copies (yikes) and collections that are used for filtering inside the task 
processing loop.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165733
-----------------------------------------------------------


On Feb. 15, 2017, 5:07 p.m., Mehrdad Nurolahzade wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> -----------------------------------------------------------
> 
> (Updated Feb. 15, 2017, 5:07 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
>     https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> -------
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>

Re: Review Request 56575: AURORA-1837 Improve task history pruning

Reply via email to