[ 
https://issues.apache.org/jira/browse/AURORA-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755820#comment-15755820
 ] 

David McLaughlin commented on AURORA-1861:
------------------------------------------

[~zmanji] asked me to update this ticket with the results we had in production. 
I've attached the snapshot times from one of our production clusters with this 
scheduler flag set (i.e. only removing updates):

{code}
-snapshot_hydrate_stores="locks,hosts,quota,tasks,crons,scheduler_metadata"
{code}

As you can see, it's reduced from around 60s to 25~30s. Given that we snapshot 
every hour and the snapshotter holds the storage write lock for the entire 
time, this is a huge win for us. 

I also attached the create snapshot graph to show that almost all reduction 
came from the snapshot creation time, and where the actual reduction comes in 
might be surprising. The third graph shows this - it is a graph for 
mybatis.org.apache.aurora.scheduler.storage.db.JobUpdateDetailsMapper.selectAllDetails_nanos_total.
 Note that the timings here are INSIDE the write lock. So these are just pure 
times to retrieve and deserialize the data. 

The graph has partial data because we only just submitted a patch to add 
metrics around mybatis mappers, and now it's gone because this query only ever 
gets called from the SnapshotStoreImpl. 

We're seeing pretty poor performance across the board for H2/MyBatis. Expect 
some follow-up work on this. 

> Remove duplicate Snapshot fields for DB stores
> ----------------------------------------------
>
>                 Key: AURORA-1861
>                 URL: https://issues.apache.org/jira/browse/AURORA-1861
>             Project: Aurora
>          Issue Type: Task
>          Components: Scheduler
>            Reporter: David McLaughlin
>            Assignee: David McLaughlin
>         Attachments: select-all-job-update-details time.png, 
> snapshot-create-time-only.png, snapshot-total-time.png
>
>
> Currently we double-write any DB-backed stores into a Snapshot struct when 
> creating a Snapshot. This inflates the size of the Snapshot, which is already 
> a problem for large production clusters (see AURORA-74). 
> Example for LockStore from 
> https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java:
> {code}
>       new SnapshotField() {
>         // It's important for locks to be replayed first, since there are 
> relations that expect
>         // references to be valid on insertion.
>         @Override
>         public void saveToSnapshot(MutableStoreProvider store, Snapshot 
> snapshot) {
>           
> snapshot.setLocks(ILock.toBuildersSet(store.getLockStore().fetchLocks()));
>         }
>         @Override
>         public void restoreFromSnapshot(MutableStoreProvider store, Snapshot 
> snapshot) {
>           if (hasDbSnapshot(snapshot)) {
>             LOG.info("Deferring lock restore to dbsnapshot");
>             return;
>           }
>           store.getLockStore().deleteLocks();
>           if (snapshot.isSetLocks()) {
>             for (Lock lock : snapshot.getLocks()) {
>               store.getLockStore().saveLock(ILock.build(lock));
>             }
>           }
>         }
>       },
> {code}
> The saveToSnapshot here is totally redundant as the entire H2 database is 
> dumped into the dbScript field. 
> Note: one major side-effect here is if anyone is trying to read these 
> snapshots and utilize the data outside of Java - they'll lose the ability to 
> process the data without being able to apply the DB script. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to