[
https://issues.apache.org/jira/browse/AURORA-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755820#comment-15755820
]
David McLaughlin commented on AURORA-1861:
------------------------------------------
[~zmanji] asked me to update this ticket with the results we had in production.
I've attached the snapshot times from one of our production clusters with this
scheduler flag set (i.e. only removing updates):
{code}
-snapshot_hydrate_stores="locks,hosts,quota,tasks,crons,scheduler_metadata"
{code}
As you can see, it's reduced from around 60s to 25~30s. Given that we snapshot
every hour and the snapshotter holds the storage write lock for the entire
time, this is a huge win for us.
I also attached the create snapshot graph to show that almost all reduction
came from the snapshot creation time, and where the actual reduction comes in
might be surprising. The third graph shows this - it is a graph for
mybatis.org.apache.aurora.scheduler.storage.db.JobUpdateDetailsMapper.selectAllDetails_nanos_total.
Note that the timings here are INSIDE the write lock. So these are just pure
times to retrieve and deserialize the data.
The graph has partial data because we only just submitted a patch to add
metrics around mybatis mappers, and now it's gone because this query only ever
gets called from the SnapshotStoreImpl.
We're seeing pretty poor performance across the board for H2/MyBatis. Expect
some follow-up work on this.
> Remove duplicate Snapshot fields for DB stores
> ----------------------------------------------
>
> Key: AURORA-1861
> URL: https://issues.apache.org/jira/browse/AURORA-1861
> Project: Aurora
> Issue Type: Task
> Components: Scheduler
> Reporter: David McLaughlin
> Assignee: David McLaughlin
> Attachments: select-all-job-update-details time.png,
> snapshot-create-time-only.png, snapshot-total-time.png
>
>
> Currently we double-write any DB-backed stores into a Snapshot struct when
> creating a Snapshot. This inflates the size of the Snapshot, which is already
> a problem for large production clusters (see AURORA-74).
> Example for LockStore from
> https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java:
> {code}
> new SnapshotField() {
> // It's important for locks to be replayed first, since there are
> relations that expect
> // references to be valid on insertion.
> @Override
> public void saveToSnapshot(MutableStoreProvider store, Snapshot
> snapshot) {
>
> snapshot.setLocks(ILock.toBuildersSet(store.getLockStore().fetchLocks()));
> }
> @Override
> public void restoreFromSnapshot(MutableStoreProvider store, Snapshot
> snapshot) {
> if (hasDbSnapshot(snapshot)) {
> LOG.info("Deferring lock restore to dbsnapshot");
> return;
> }
> store.getLockStore().deleteLocks();
> if (snapshot.isSetLocks()) {
> for (Lock lock : snapshot.getLocks()) {
> store.getLockStore().saveLock(ILock.build(lock));
> }
> }
> }
> },
> {code}
> The saveToSnapshot here is totally redundant as the entire H2 database is
> dumped into the dbScript field.
> Note: one major side-effect here is if anyone is trying to read these
> snapshots and utilize the data outside of Java - they'll lose the ability to
> process the data without being able to apply the DB script.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)