Re: Review Request 59030: AURORA-1869 Reducing storage write lock contention in TaskStatusHandlerImpl

Stephan Erb Mon, 08 May 2017 09:09:58 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59030/#review174175
-----------------------------------------------------------




While the change looks innocent, I am wondering if it invalidates parts of the 
previous design to process everything in batches. In additon, it could lead to 
a significantly reduced worst-case performance after a net split, as we will 
now have multiple instead of a single lock operations per batch. No idea how 
likely this is though...

Best thing is probably to wait for your testing in the scale environment, then 
we can see how we want to proceed.

- Stephan Erb


On May 5, 2017, 11:36 p.m., Mehrdad Nurolahzade wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59030/
> -----------------------------------------------------------
> 
> (Updated May 5, 2017, 11:36 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1869
>     https://issues.apache.org/jira/browse/AURORA-1869
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> `TaskStatusHandlerImpl` acquires `LogStorage` write lock for processing every 
> status update received from Mesos master. During implicit and explicit 
> reconciliations, this amounts to the number of tasks in the cluster (tens of 
> thousands of times in our cluster).
> 
> According to data extracted from one of our production clusters, over 99.9% 
> of reconciliation status update events are in fact `NOOP` status updates. The 
> storage write lock contention induced by these status updates can simply be 
> eliminated by adopting double-checked locking pattern (as was done in 
> [AURORA-1820](https://issues.apache.org/jira/browse/AURORA-1820)).
> 
> This explains why the combination of reconciliation status update processing 
> and other expensive processes like snapshot can be fatal for scheduler. As 
> the lock is not fair, it does not guarantee any particular access order. 
> Therefore, snapshot structures might need to sit on the heap for a few 
> seconds before they can be written to `LogStorage` and garbage collected.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/TaskStatusHandlerImpl.java 
> 1aacecf3c2597a3f91dbc7da4c99fd1e80970f04 
>   src/test/java/org/apache/aurora/scheduler/TaskStatusHandlerImplTest.java 
> 56a6b0c9ae8da18e9a47428b8ed37a559cfd04e7 
>   
> src/test/java/org/apache/aurora/scheduler/storage/testing/StorageTestUtil.java
>  21d26b3930ea965487b2dec48a48a98677ba022b 
> 
> 
> Diff: https://reviews.apache.org/r/59030/diff/1/
> 
> 
> Testing
> -------
> 
> TBD under a test cluster
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>

Re: Review Request 59030: AURORA-1869 Reducing storage write lock contention in TaskStatusHandlerImpl

Reply via email to