-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59030/
-----------------------------------------------------------
Review request for Aurora, David McLaughlin, Stephan Erb, and Zameer Manji.
Bugs: AURORA-1869
https://issues.apache.org/jira/browse/AURORA-1869
Repository: aurora
Description
-------
`TaskStatusHandlerImpl` acquires `LogStorage` write lock for processing every
status update received from Mesos master. During implicit and explicit
reconciliations, this amounts to the number of tasks in the cluster (tens of
thousands of times in our cluster).
According to data extracted from one of our production clusters, over 99.9% of
reconciliation status update events are in fact `NOOP` status updates. The
storage write lock contention induced by these status updates can simply be
eliminated by adopting double-checked locking pattern (as was done in
[AURORA-1820](https://issues.apache.org/jira/browse/AURORA-1820)).
This explains why the combination of reconciliation status update processing
and other expensive processes like snapshot can be fatal for scheduler. As the
lock is not fair, it does not guarantee any particular access order. Therefore,
snapshot structures might need to sit on the heap for a few seconds before they
can be written to `LogStorage` and garbage collected.
Diffs
-----
src/main/java/org/apache/aurora/scheduler/TaskStatusHandlerImpl.java
1aacecf3c2597a3f91dbc7da4c99fd1e80970f04
src/test/java/org/apache/aurora/scheduler/TaskStatusHandlerImplTest.java
56a6b0c9ae8da18e9a47428b8ed37a559cfd04e7
src/test/java/org/apache/aurora/scheduler/storage/testing/StorageTestUtil.java
21d26b3930ea965487b2dec48a48a98677ba022b
Diff: https://reviews.apache.org/r/59030/diff/1/
Testing
-------
TBD under a test cluster
Thanks,
Mehrdad Nurolahzade