-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/62397/#review186521
-----------------------------------------------------------



FYI - i plan to open a discussion on this soon; but i intend to propose we take 
the scheduler storage in a different direction, which would make this patch 
unnecessary.  I need to write some prototype code to vet my idea, but i want to 
put this on the radar in prevents spending effort in a conflicting direction.

- Bill Farner


On Sept. 18, 2017, 10:32 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/62397/
> -----------------------------------------------------------
> 
> (Updated Sept. 18, 2017, 10:32 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch implements the option to turn on 'Replica Hot Standby', where 
> Scheduler replicas will be able to keep their volatile storage up to date 
> throughout normal operation. The main motivation behind this change is to 
> reduce failover time. If the leader fails over, the elected replica will be 
> able to serve production traffic much quicker as it has to rebuild less 
> state. However, this change enables future work such as snapshots on replicas 
> and serving traffic from replicas.
> 
> There have been several discussions around this feature: 
> https://lists.apache.org/thread.html/e31d7dbcb054ed570f969ae2043eadfc090383edfe0751cec59b29d3@%3Cdev.aurora.apache.org%3E
> https://lists.apache.org/thread.html/e31d7dbcb054ed570f969ae2043eadfc090383edfe0751cec59b29d3@%3Cdev.aurora.apache.org%3E
> 
> Culminating in a design doc:
> https://docs.google.com/document/d/1DOtKA4-vrtxat1MaUYMQ6Y1iXhA8ob6Mfztzt-R1Oss/edit#heading=h.gjdgxs
> 
> The related Mesos patch can be found here:
> https://reviews.apache.org/r/62288/
> 
> 
> Diffs
> -----
> 
>   config/legacy_untested_classes.txt ec3e934b2e0510b9339ac71182b78546cac0e7eb 
>   src/main/java/org/apache/aurora/scheduler/log/Log.java 
> dc77eb435e5f8fdce56727a2f679e8e1907e977c 
>   src/main/java/org/apache/aurora/scheduler/log/mesos/LogInterface.java 
> b0a7939131e1a3dceaf9635aec6746a5cd7ad394 
>   src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLog.java 
> 21855e184fe20dc339713978b32344b6950701ec 
>   
> src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLogStreamModule.java 
> 6704a328a4023a178ed8f86ae4772cb04eb2fa8e 
>   
> src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java
>  2a5ec9c912979811c4badeee9362c22184d9cbbf 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java 
> 387350c7667a5fb8ee674ad0d3dd17529232b25b 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 
> 835f1604c0c5d913a87d570ee01d053bbbf92ecb 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 
> ea147c0ba6aaa6d113144be0a8330bd2f73d2453 
>   
> src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java 
> baf2647c54f1f9918139584e5151873a3853e83c 
>   src/test/java/org/apache/aurora/scheduler/app/SchedulerIT.java 
> a7c9c83eebbbea7ae8a6c807f98d3ce8bd050137 
>   src/test/java/org/apache/aurora/scheduler/log/mesos/MesosLogTest.java 
> f142f545799d64f9352b0ac6c51942eedf5e9ced 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 
> 3f445595a81a5655c6c486791a9b55d8dc7f2f76 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 
> 0eb54fdaddfbc2af76fd83ffee18ce4c6b61cc48 
> 
> 
> Diff: https://reviews.apache.org/r/62397/diff/1/
> 
> 
> Testing
> -------
> 
> Added unit tests.
> 
> The current version of Mesos does not have the `catchup` function so the 
> tests will fail CI. However, they work on my box if I manually build Mesos 
> with the API change and add it to the dependencies.
> 
> I have had this patch running on a test cluster for the past couple of weeks. 
> There are a few small issues to work out around catchup failure, but it is 
> generally stable.
> 
> Initial (ad-hoc) observations show an improvement in 
> 'scheduler_storage_start' from 70-200 seconds to 5-80 seconds, depending on 
> if the failover occured immediately after a snapshot. I will compile more 
> comprehensive statistics around the results later (ex. time from scheduler 
> disconnect to new scheduler being elected and serving traffic).
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>

Reply via email to