Thanks for the detailed write up and real-world details! I generally support momentum towards a single task store implementation, so +1 on dealing with that.
I anticipated there would be a performance win from straight-to-SQL snapshots, so I am a +1 on that as well. In summary, +1 on all fronts! On Monday, February 29, 2016, Maxim Khutornenko <ma...@apache.org> wrote: > (Apologies for the wordy problem statement but I feel it's really > necessary to justify the proposal). > > Over the past two weeks we have been battling a nasty scheduler issue > in production: the scheduler suddenly stops responding to any user > requests and subsequently gets killed by our health monitoring. Upon > restart, a leader may only function for a few seconds and almost > immediately hangs again. > > The long and painful investigation pointed towards internal H2 table > lock contention that resulted in a massive db-write starvation and a > state where a scheduler write lock would *never* be released. This was > relatively easy to replicate in Vagrant by creating a large update > (~4K instances) with a large batch_size (~1K), while bombarding the > scheduler with getJobUpdateDetails() requests for that job. The > scheduler would enter a locked up state on the very first write op > following the update creation (e.g. a status update for an instance > transition from the first batch) and stay in that state for minutes > until all getJobUpdateDetails() requests are served. This behavior is > well explained by the following sentence from [1]: > > "When a lock is released, and multiple connections are waiting for > it, one of them is picked at random." > > What happens here is that in a situation when many more read requests > are competing for a shared table lock, the H2 PageStore does not help > write requests requiring an exclusive table lock in any way to > succeed. This leads to db-write starvation and eventual scheduler > native store write starvation as there is no timeout on a scheduler > write lock. > > We have played with various available H2/MyBatis configuration > settings to mitigate the above with no noticeable impact. That, until > we switched to H2 MVStore [2], at which point we were able to > completely eliminate the scheduler lockup without making any other > code changes! So, the solution has finally been found? The answer > would be YES until you try MVStore-enabled H2 with any reasonable size > production DB on scheduler restart. There was a reason why we disabled > MVStore in the scheduler [3] in the first place and that reason was > poor MVStore performance with bulk inserts. Re-populating > MVStore-enabled H2 DB took at least 2.5 times longer than normal. This > is unacceptable in prod where every second of scheduler downtime > counts. > > Back to the drawing board, we tried all relevant settings and > approaches to speed up MVStore inserts on restart but nothing really > helped. Finally, the only reasonable way forward was to eliminate the > point of slowness altogether - namely remove thrift-to-sql migration > on restart. Fortunately, H2 supports an easy to operate command to > generate the entire DB dump with a single statement [4]. We were now > able to bypass the lengthly DB repopulation on restart by storing the > entire DB dump in snapshot and replaying it on scheduler restart. > > > Now, the proposal. Given that MVStore vastly outperforms PageStore we > currently use, I suggest we move our H2 to it AND adopt db > snapshotting instead of thrift snapshotting to speed up scheduler > restarts. The rough POC is available here [5]. We are running a > version of this build in production since last week and were able to > completely eliminate scheduler lockups. As a welcome side effect, we > also observed faster scheduler restart times due to eliminating > thrift-to-sql chattiness. Depending on the snapshot freshness the > observed failover downtimes got reduced by ~40%. > > Moving to db snapshotting will require us to rethink DB schema > versioning and thrift deprecating/removal policy. We will have to move > to pre-/post- snapshot restore SQL migration scripts to handle any > schema changes, which is a common industry pattern but something we > have not tried yet. The upside though is that we can get an early > start here as we will have to adopt strict SQL migration rules anyway > when we move to persistent DB storage. Also, given that migrating to > H2 TaskStore will likely further degrade scheduler restart times, > having a better performing DB snapshotting solution in place will > definitely aid migration. > > Thanks, > Maxim > > [1] - http://www.h2database.com/html/advanced.html?#transaction_isolation > [2] - http://www.h2database.com/html/mvstore.html > [3] - > https://github.com/apache/aurora/blob/824e396ab80874cfea98ef47829279126838a3b2/src/main/java/org/apache/aurora/scheduler/storage/db/DbModule.java#L119 > [4] - http://www.h2database.com/html/grammar.html#script > [5] - > https://github.com/maxim111333/incubator-aurora/blob/mv_store/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L317-L370 >