-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59030/
-----------------------------------------------------------

Review request for Aurora, David McLaughlin, Stephan Erb, and Zameer Manji.


Bugs: AURORA-1869
    https://issues.apache.org/jira/browse/AURORA-1869


Repository: aurora


Description
-------

`TaskStatusHandlerImpl` acquires `LogStorage` write lock for processing every 
status update received from Mesos master. During implicit and explicit 
reconciliations, this amounts to the number of tasks in the cluster (tens of 
thousands of times in our cluster).

According to data extracted from one of our production clusters, over 99.9% of 
reconciliation status update events are in fact `NOOP` status updates. The 
storage write lock contention induced by these status updates can simply be 
eliminated by adopting double-checked locking pattern (as was done in 
[AURORA-1820](https://issues.apache.org/jira/browse/AURORA-1820)).

This explains why the combination of reconciliation status update processing 
and other expensive processes like snapshot can be fatal for scheduler. As the 
lock is not fair, it does not guarantee any particular access order. Therefore, 
snapshot structures might need to sit on the heap for a few seconds before they 
can be written to `LogStorage` and garbage collected.


Diffs
-----

  src/main/java/org/apache/aurora/scheduler/TaskStatusHandlerImpl.java 
1aacecf3c2597a3f91dbc7da4c99fd1e80970f04 
  src/test/java/org/apache/aurora/scheduler/TaskStatusHandlerImplTest.java 
56a6b0c9ae8da18e9a47428b8ed37a559cfd04e7 
  
src/test/java/org/apache/aurora/scheduler/storage/testing/StorageTestUtil.java 
21d26b3930ea965487b2dec48a48a98677ba022b 


Diff: https://reviews.apache.org/r/59030/diff/1/


Testing
-------

TBD under a test cluster


Thanks,

Mehrdad Nurolahzade

Reply via email to