----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/62397/#review186521 -----------------------------------------------------------
FYI - i plan to open a discussion on this soon; but i intend to propose we take the scheduler storage in a different direction, which would make this patch unnecessary. I need to write some prototype code to vet my idea, but i want to put this on the radar in prevents spending effort in a conflicting direction. - Bill Farner On Sept. 18, 2017, 10:32 p.m., Jordan Ly wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/62397/ > ----------------------------------------------------------- > > (Updated Sept. 18, 2017, 10:32 p.m.) > > > Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and > Stephan Erb. > > > Repository: aurora > > > Description > ------- > > This patch implements the option to turn on 'Replica Hot Standby', where > Scheduler replicas will be able to keep their volatile storage up to date > throughout normal operation. The main motivation behind this change is to > reduce failover time. If the leader fails over, the elected replica will be > able to serve production traffic much quicker as it has to rebuild less > state. However, this change enables future work such as snapshots on replicas > and serving traffic from replicas. > > There have been several discussions around this feature: > https://lists.apache.org/thread.html/e31d7dbcb054ed570f969ae2043eadfc090383edfe0751cec59b29d3@%3Cdev.aurora.apache.org%3E > https://lists.apache.org/thread.html/e31d7dbcb054ed570f969ae2043eadfc090383edfe0751cec59b29d3@%3Cdev.aurora.apache.org%3E > > Culminating in a design doc: > https://docs.google.com/document/d/1DOtKA4-vrtxat1MaUYMQ6Y1iXhA8ob6Mfztzt-R1Oss/edit#heading=h.gjdgxs > > The related Mesos patch can be found here: > https://reviews.apache.org/r/62288/ > > > Diffs > ----- > > config/legacy_untested_classes.txt ec3e934b2e0510b9339ac71182b78546cac0e7eb > src/main/java/org/apache/aurora/scheduler/log/Log.java > dc77eb435e5f8fdce56727a2f679e8e1907e977c > src/main/java/org/apache/aurora/scheduler/log/mesos/LogInterface.java > b0a7939131e1a3dceaf9635aec6746a5cd7ad394 > src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLog.java > 21855e184fe20dc339713978b32344b6950701ec > > src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLogStreamModule.java > 6704a328a4023a178ed8f86ae4772cb04eb2fa8e > > src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java > 2a5ec9c912979811c4badeee9362c22184d9cbbf > src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java > 387350c7667a5fb8ee674ad0d3dd17529232b25b > src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java > 835f1604c0c5d913a87d570ee01d053bbbf92ecb > src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java > ea147c0ba6aaa6d113144be0a8330bd2f73d2453 > > src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java > baf2647c54f1f9918139584e5151873a3853e83c > src/test/java/org/apache/aurora/scheduler/app/SchedulerIT.java > a7c9c83eebbbea7ae8a6c807f98d3ce8bd050137 > src/test/java/org/apache/aurora/scheduler/log/mesos/MesosLogTest.java > f142f545799d64f9352b0ac6c51942eedf5e9ced > src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java > 3f445595a81a5655c6c486791a9b55d8dc7f2f76 > src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java > 0eb54fdaddfbc2af76fd83ffee18ce4c6b61cc48 > > > Diff: https://reviews.apache.org/r/62397/diff/1/ > > > Testing > ------- > > Added unit tests. > > The current version of Mesos does not have the `catchup` function so the > tests will fail CI. However, they work on my box if I manually build Mesos > with the API change and add it to the dependencies. > > I have had this patch running on a test cluster for the past couple of weeks. > There are a few small issues to work out around catchup failure, but it is > generally stable. > > Initial (ad-hoc) observations show an improvement in > 'scheduler_storage_start' from 70-200 seconds to 5-80 seconds, depending on > if the failover occured immediately after a snapshot. I will compile more > comprehensive statistics around the results later (ex. time from scheduler > disconnect to new scheduler being elected and serving traffic). > > > Thanks, > > Jordan Ly > >
