Seems prudent to explore rather than write off though. For all we know it simplifies a lot.
On Wednesday, March 2, 2016, Maxim Khutornenko <ma...@apache.org> wrote: > Ah, sorry, missed that conversation on IRC. > > I have not looked into that. Would be interesting to explore that > route. Given our ultimate goal is to get rid of the replicated log > altogether it does not stand as an immediate priority to me though. > > On Wed, Mar 2, 2016 at 11:51 AM, Erb, Stephan > <stephan....@blue-yonder.com <javascript:;>> wrote: > > +1 for the plan and the ticket. > > > > In addition, for reference a couple of messages from IRC from yesterday: > > > > 23:42 <serb> mkhutornenko: interesting storage proposal on the > mailinglist! I only wondered one thing... > > 23:42 <serb> it feeld kind of weird that we use H2 as a non-replicated > database and build some scaffolding around it in order to distribute its > state via the Mesos replicated log. > > 23:42 <serb> Have you looked into H2, if it would be possible to > replace/subclass their in-process transaction log with a replicated Mesos > one? > > 23:43 <serb> Then we would not need that logic that performs a > simultaneous inserts into the log and the taskstore, as the backend would > handle that by itself > > 23:44 <serb> (I know close to nothing about the storage layer, so that's > like my perspective from 10.000 feet) > > > > 00:22 <wfarner> serb: that crossed my mind as well. I have only drilled > in a bit, would love to more > > > > ________________________________________ > > From: Maxim Khutornenko <ma...@apache.org <javascript:;>> > > Sent: Wednesday, March 2, 2016 18:18 > > To: dev@aurora.apache.org <javascript:;> > > Subject: Re: [PROPOSAL] DB snapshotting > > > > Thanks Bill! Filed https://issues.apache.org/jira/browse/AURORA-1627 > > to track it. > > > > On Mon, Feb 29, 2016 at 11:41 AM, Bill Farner <wfar...@apache.org > <javascript:;>> wrote: > >> Thanks for the detailed write up and real-world details! I generally > >> support momentum towards a single task store implementation, so +1 > >> on dealing with that. > >> > >> I anticipated there would be a performance win from straight-to-SQL > >> snapshots, so I am a +1 on that as well. > >> > >> In summary, +1 on all fronts! > >> > >> On Monday, February 29, 2016, Maxim Khutornenko <ma...@apache.org > <javascript:;>> wrote: > >> > >>> (Apologies for the wordy problem statement but I feel it's really > >>> necessary to justify the proposal). > >>> > >>> Over the past two weeks we have been battling a nasty scheduler issue > >>> in production: the scheduler suddenly stops responding to any user > >>> requests and subsequently gets killed by our health monitoring. Upon > >>> restart, a leader may only function for a few seconds and almost > >>> immediately hangs again. > >>> > >>> The long and painful investigation pointed towards internal H2 table > >>> lock contention that resulted in a massive db-write starvation and a > >>> state where a scheduler write lock would *never* be released. This was > >>> relatively easy to replicate in Vagrant by creating a large update > >>> (~4K instances) with a large batch_size (~1K), while bombarding the > >>> scheduler with getJobUpdateDetails() requests for that job. The > >>> scheduler would enter a locked up state on the very first write op > >>> following the update creation (e.g. a status update for an instance > >>> transition from the first batch) and stay in that state for minutes > >>> until all getJobUpdateDetails() requests are served. This behavior is > >>> well explained by the following sentence from [1]: > >>> > >>> "When a lock is released, and multiple connections are waiting for > >>> it, one of them is picked at random." > >>> > >>> What happens here is that in a situation when many more read requests > >>> are competing for a shared table lock, the H2 PageStore does not help > >>> write requests requiring an exclusive table lock in any way to > >>> succeed. This leads to db-write starvation and eventual scheduler > >>> native store write starvation as there is no timeout on a scheduler > >>> write lock. > >>> > >>> We have played with various available H2/MyBatis configuration > >>> settings to mitigate the above with no noticeable impact. That, until > >>> we switched to H2 MVStore [2], at which point we were able to > >>> completely eliminate the scheduler lockup without making any other > >>> code changes! So, the solution has finally been found? The answer > >>> would be YES until you try MVStore-enabled H2 with any reasonable size > >>> production DB on scheduler restart. There was a reason why we disabled > >>> MVStore in the scheduler [3] in the first place and that reason was > >>> poor MVStore performance with bulk inserts. Re-populating > >>> MVStore-enabled H2 DB took at least 2.5 times longer than normal. This > >>> is unacceptable in prod where every second of scheduler downtime > >>> counts. > >>> > >>> Back to the drawing board, we tried all relevant settings and > >>> approaches to speed up MVStore inserts on restart but nothing really > >>> helped. Finally, the only reasonable way forward was to eliminate the > >>> point of slowness altogether - namely remove thrift-to-sql migration > >>> on restart. Fortunately, H2 supports an easy to operate command to > >>> generate the entire DB dump with a single statement [4]. We were now > >>> able to bypass the lengthly DB repopulation on restart by storing the > >>> entire DB dump in snapshot and replaying it on scheduler restart. > >>> > >>> > >>> Now, the proposal. Given that MVStore vastly outperforms PageStore we > >>> currently use, I suggest we move our H2 to it AND adopt db > >>> snapshotting instead of thrift snapshotting to speed up scheduler > >>> restarts. The rough POC is available here [5]. We are running a > >>> version of this build in production since last week and were able to > >>> completely eliminate scheduler lockups. As a welcome side effect, we > >>> also observed faster scheduler restart times due to eliminating > >>> thrift-to-sql chattiness. Depending on the snapshot freshness the > >>> observed failover downtimes got reduced by ~40%. > >>> > >>> Moving to db snapshotting will require us to rethink DB schema > >>> versioning and thrift deprecating/removal policy. We will have to move > >>> to pre-/post- snapshot restore SQL migration scripts to handle any > >>> schema changes, which is a common industry pattern but something we > >>> have not tried yet. The upside though is that we can get an early > >>> start here as we will have to adopt strict SQL migration rules anyway > >>> when we move to persistent DB storage. Also, given that migrating to > >>> H2 TaskStore will likely further degrade scheduler restart times, > >>> having a better performing DB snapshotting solution in place will > >>> definitely aid migration. > >>> > >>> Thanks, > >>> Maxim > >>> > >>> [1] - > http://www.h2database.com/html/advanced.html?#transaction_isolation > >>> [2] - http://www.h2database.com/html/mvstore.html > >>> [3] - > >>> > https://github.com/apache/aurora/blob/824e396ab80874cfea98ef47829279126838a3b2/src/main/java/org/apache/aurora/scheduler/storage/db/DbModule.java#L119 > >>> [4] - http://www.h2database.com/html/grammar.html#script > >>> [5] - > >>> > https://github.com/maxim111333/incubator-aurora/blob/mv_store/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L317-L370 > >>> >