-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/62397/#review186461
-----------------------------------------------------------



I won't have time for a proper review before next week.

However, one question is already bugging me: What would be our upgrade path to 
the required Mesos version? Can we directly upgrade to Mesos 1.X (assuming 1.X 
contains the extended replicated log)? Or do we have to step through 1.3, 1.4, 
... to 1.x to get the support? I believe with the release of 1.0 Mesos promised 
that it should be possible to leave out intermediate versions but we should 
verify to be sure.

- Stephan Erb


On Sept. 19, 2017, 7:32 a.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/62397/
> -----------------------------------------------------------
> 
> (Updated Sept. 19, 2017, 7:32 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch implements the option to turn on 'Replica Hot Standby', where 
> Scheduler replicas will be able to keep their volatile storage up to date 
> throughout normal operation. The main motivation behind this change is to 
> reduce failover time. If the leader fails over, the elected replica will be 
> able to serve production traffic much quicker as it has to rebuild less 
> state. However, this change enables future work such as snapshots on replicas 
> and serving traffic from replicas.
> 
> There have been several discussions around this feature: 
> https://lists.apache.org/thread.html/e31d7dbcb054ed570f969ae2043eadfc090383edfe0751cec59b29d3@%3Cdev.aurora.apache.org%3E
> https://lists.apache.org/thread.html/e31d7dbcb054ed570f969ae2043eadfc090383edfe0751cec59b29d3@%3Cdev.aurora.apache.org%3E
> 
> Culminating in a design doc:
> https://docs.google.com/document/d/1DOtKA4-vrtxat1MaUYMQ6Y1iXhA8ob6Mfztzt-R1Oss/edit#heading=h.gjdgxs
> 
> The related Mesos patch can be found here:
> https://reviews.apache.org/r/62288/
> 
> 
> Diffs
> -----
> 
>   config/legacy_untested_classes.txt ec3e934b2e0510b9339ac71182b78546cac0e7eb 
>   src/main/java/org/apache/aurora/scheduler/log/Log.java 
> dc77eb435e5f8fdce56727a2f679e8e1907e977c 
>   src/main/java/org/apache/aurora/scheduler/log/mesos/LogInterface.java 
> b0a7939131e1a3dceaf9635aec6746a5cd7ad394 
>   src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLog.java 
> 21855e184fe20dc339713978b32344b6950701ec 
>   
> src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLogStreamModule.java 
> 6704a328a4023a178ed8f86ae4772cb04eb2fa8e 
>   
> src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java
>  2a5ec9c912979811c4badeee9362c22184d9cbbf 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java 
> 387350c7667a5fb8ee674ad0d3dd17529232b25b 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 
> 835f1604c0c5d913a87d570ee01d053bbbf92ecb 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 
> ea147c0ba6aaa6d113144be0a8330bd2f73d2453 
>   
> src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java 
> baf2647c54f1f9918139584e5151873a3853e83c 
>   src/test/java/org/apache/aurora/scheduler/app/SchedulerIT.java 
> a7c9c83eebbbea7ae8a6c807f98d3ce8bd050137 
>   src/test/java/org/apache/aurora/scheduler/log/mesos/MesosLogTest.java 
> f142f545799d64f9352b0ac6c51942eedf5e9ced 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 
> 3f445595a81a5655c6c486791a9b55d8dc7f2f76 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 
> 0eb54fdaddfbc2af76fd83ffee18ce4c6b61cc48 
> 
> 
> Diff: https://reviews.apache.org/r/62397/diff/1/
> 
> 
> Testing
> -------
> 
> Added unit tests.
> 
> The current version of Mesos does not have the `catchup` function so the 
> tests will fail CI. However, they work on my box if I manually build Mesos 
> with the API change and add it to the dependencies.
> 
> I have had this patch running on a test cluster for the past couple of weeks. 
> There are a few small issues to work out around catchup failure, but it is 
> generally stable.
> 
> Initial (ad-hoc) observations show an improvement in 
> 'scheduler_storage_start' from 70-200 seconds to 5-80 seconds, depending on 
> if the failover occured immediately after a snapshot. I will compile more 
> comprehensive statistics around the results later (ex. time from scheduler 
> disconnect to new scheduler being elected and serving traffic).
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>

Reply via email to