(Apologies for the wordy problem statement but I feel it's really necessary to justify the proposal).
Over the past two weeks we have been battling a nasty scheduler issue in production: the scheduler suddenly stops responding to any user requests and subsequently gets killed by our health monitoring. Upon restart, a leader may only function for a few seconds and almost immediately hangs again. The long and painful investigation pointed towards internal H2 table lock contention that resulted in a massive db-write starvation and a state where a scheduler write lock would *never* be released. This was relatively easy to replicate in Vagrant by creating a large update (~4K instances) with a large batch_size (~1K), while bombarding the scheduler with getJobUpdateDetails() requests for that job. The scheduler would enter a locked up state on the very first write op following the update creation (e.g. a status update for an instance transition from the first batch) and stay in that state for minutes until all getJobUpdateDetails() requests are served. This behavior is well explained by the following sentence from [1]: "When a lock is released, and multiple connections are waiting for it, one of them is picked at random." What happens here is that in a situation when many more read requests are competing for a shared table lock, the H2 PageStore does not help write requests requiring an exclusive table lock in any way to succeed. This leads to db-write starvation and eventual scheduler native store write starvation as there is no timeout on a scheduler write lock. We have played with various available H2/MyBatis configuration settings to mitigate the above with no noticeable impact. That, until we switched to H2 MVStore [2], at which point we were able to completely eliminate the scheduler lockup without making any other code changes! So, the solution has finally been found? The answer would be YES until you try MVStore-enabled H2 with any reasonable size production DB on scheduler restart. There was a reason why we disabled MVStore in the scheduler [3] in the first place and that reason was poor MVStore performance with bulk inserts. Re-populating MVStore-enabled H2 DB took at least 2.5 times longer than normal. This is unacceptable in prod where every second of scheduler downtime counts. Back to the drawing board, we tried all relevant settings and approaches to speed up MVStore inserts on restart but nothing really helped. Finally, the only reasonable way forward was to eliminate the point of slowness altogether - namely remove thrift-to-sql migration on restart. Fortunately, H2 supports an easy to operate command to generate the entire DB dump with a single statement [4]. We were now able to bypass the lengthly DB repopulation on restart by storing the entire DB dump in snapshot and replaying it on scheduler restart. Now, the proposal. Given that MVStore vastly outperforms PageStore we currently use, I suggest we move our H2 to it AND adopt db snapshotting instead of thrift snapshotting to speed up scheduler restarts. The rough POC is available here [5]. We are running a version of this build in production since last week and were able to completely eliminate scheduler lockups. As a welcome side effect, we also observed faster scheduler restart times due to eliminating thrift-to-sql chattiness. Depending on the snapshot freshness the observed failover downtimes got reduced by ~40%. Moving to db snapshotting will require us to rethink DB schema versioning and thrift deprecating/removal policy. We will have to move to pre-/post- snapshot restore SQL migration scripts to handle any schema changes, which is a common industry pattern but something we have not tried yet. The upside though is that we can get an early start here as we will have to adopt strict SQL migration rules anyway when we move to persistent DB storage. Also, given that migrating to H2 TaskStore will likely further degrade scheduler restart times, having a better performing DB snapshotting solution in place will definitely aid migration. Thanks, Maxim [1] - http://www.h2database.com/html/advanced.html?#transaction_isolation [2] - http://www.h2database.com/html/mvstore.html [3] - https://github.com/apache/aurora/blob/824e396ab80874cfea98ef47829279126838a3b2/src/main/java/org/apache/aurora/scheduler/storage/db/DbModule.java#L119 [4] - http://www.h2database.com/html/grammar.html#script [5] - https://github.com/maxim111333/incubator-aurora/blob/mv_store/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L317-L370