Thanks for the pointers regarding the broken documentation. I will fix that.

The configuration options have moved and are now described here 
http://aurora.apache.org/documentation/latest/operations/configuration/#replicated-log-configuration


On 03/10/16 09:05, "meghdoot bhattacharya" <meghdoo...@yahoo.com.INVALID> wrote:

    Zameer, thanks for sharing this. For folks who are looking to operate 
Aurora with HA this is very valuable. Operational insights from aurora experts 
is always welcome.Not to hijack the conversation on the 3 questions you asked, 
I found inhttp://aurora.apache.org/documentation/latest/operations/storage/
    
    the links to "Replicated log configuration" and "Backup configuration" 
broken. Last reference I can find in 0.12 
releasehttp://aurora.apache.org/documentation/0.12.0/storage-config/
    
    If the operational documents can be reviewed and enhanced further that 
would be helpful. Highlighting stats like 'scheduler_log_snapshots' or the load 
time for example can be alert points to operators.
    Thx
          From: Zameer Manji <zma...@apache.org>
     To: dev@aurora.apache.org; u...@aurora.apache.org 
     Sent: Friday, September 30, 2016 5:34 PM
     Subject: A mini postmortem on snapshot failures
       
    Aurora Developers and Users,
    
    I would like to share failure case I experienced recently. In a modestly
    sized production cluster with high read load, snapshot creation began to
    fail. The logs showed the following:
    
    ````
    W0923 00:23:55.528 [LogStorage-0, LogStorage:473]
    ### Error rolling back transaction.  Cause: java.sql.SQLException: Error
    accessing PooledConnection. Connection is invalid.
    ### Cause: java.sql.SQLException: Error accessing PooledConnection.
    Connection is invalid. org.apache.ibatis.exceptions.PersistenceException:
    ### Error rolling back transaction.  Cause: java.sql.SQLException: Error
    accessing PooledConnection. Connection is invalid.
    ### Cause: java.sql.SQLException: Error accessing PooledConnection.
    Connection is invalid.
    at
    
org.apache.ibatis.exceptions.ExceptionFactory.wrapException(ExceptionFactory.java:30)
    at
    
org.apache.ibatis.session.defaults.DefaultSqlSession.rollback(DefaultSqlSession.java:216)
    at
    
org.apache.ibatis.session.SqlSessionManager.rollback(SqlSessionManager.java:299)
    at
    
org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:116)
    at
    
org.apache.aurora.scheduler.storage.db.DbStorage.lambda$write$0(DbStorage.java:175)
    at
    
org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(GatingDelayExecutor.java:62)
    at
    org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:173)
    at
    
org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
    at
    
org.apache.aurora.scheduler.storage.log.LogStorage.doInTransaction(LogStorage.java:521)
    at
    
org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:551)
    at
    
org.apache.aurora.scheduler.storage.log.LogStorage.doSnapshot(LogStorage.java:489)
    at
    
org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
    at
    
org.apache.aurora.scheduler.storage.log.LogStorage.snapshot(LogStorage.java:565)
    at
    
org.apache.aurora.scheduler.storage.log.LogStorage.lambda$scheduleSnapshots$20(LogStorage.java:468)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at
    
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at
    
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at
    
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
    
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
    Caused by: java.sql.SQLException: Error accessing PooledConnection.
    Connection is invalid.
    at
    
org.apache.ibatis.datasource.pooled.PooledConnection.checkConnection(PooledConnection.java:254)
    at
    
org.apache.ibatis.datasource.pooled.PooledConnection.invoke(PooledConnection.java:243)
    at com.sun.proxy.$Proxy135.getAutoCommit(Unknown Source)
    at
    
org.apache.ibatis.transaction.jdbc.JdbcTransaction.rollback(JdbcTransaction.java:79)
    at org.apache.ibatis.executor.BaseExecutor.rollback(BaseExecutor.java:249)
    at
    
org.apache.ibatis.executor.CachingExecutor.rollback(CachingExecutor.java:119)
    at
    
org.apache.ibatis.session.defaults.DefaultSqlSession.rollback(DefaultSqlSession.java:213)
    ... 19 common frames omitted
    ````
    
    This failure is silent and can be observed only through the
    `scheduler_log_snapshots` metric, if it isn't increasing then snapshots are
    not being created. In this cluster, a snapshot was not taken for about 4
    days.
    For those unfamiliar with Aurora's replicated log storage system, snapshot
    creation is important because it allows us to truncate the number of
    entries in the replicated log to a single large entry. This is required
    because the log recovery time is proportional to the number of entries in
    the log. Operators can observe the amount of time it takes to recover the
    log at startup via the `scheduler_log_recover_nanos_total` metric.
    
    The largest value observed for `scheduler_log_recover_nanos_total` during
    this period was 8 minutes. This means that recovery from a failover would
    take at least 8 minutes. For reference, a system aiming to have 99.99%
    uptime can only sustain 4 minutes of downtime a month.
    
    The root cause of this can be found from the exception in the above stack
    trace:
    `Caused by: java.sql.SQLException: Error accessing PooledConnection.
    Connection is invalid.`
    This originates from the MyBatis connection pool used to communicate with
    the in memory SQL store. To create a snapshot, we run a `SCRIPT` query to
    dump the entire database into the replicated log [1].
    
    This exception is being thrown because we have a connection pool to
    communicate with the H2 SQL database. By default the connection pool has
    the following properties:
    * Maximum 10 active connections
    * Maximum connection time of 20s before being considered for eviction.
    
    Under high read load, there can be many pending SQL queries for a
    connection. If a single connection takes more than 20s it will likely be
    evicted. In this case running one of the `SCRIPT` queries was taking more
    than 20s and there were many pending queries which caused MyBatis to evict
    the connection for the `SCRIPT` query, causing snapshot creation failure.
    
    To fix this issue, operators used the `-db_max_active_connection_count` to
    increase the maximum number of active connections for MyBatis to 100. Once
    the scheduler was able to serve requests, operators used `aurora_admin
    scheduler_snapshot` to force create a snapshot. Then a scheduler failover
    was induced and it was observed that recovery time dropped to about 40
    seconds.
    
    Today this cluster continues running with this flag and value to ensure it
    can continue to serve a high read load.
    
    I would like to raise three questions:
    * Should we add a flag to tune the maximum connection time for MyBatis?
    * Should a Snapshot creation failure be fatal?
    * Should we change the default maximum connection time and maximum number
    of active connections?
    
    [1]:
    
https://github.com/apache/aurora/blob/rel/0.16.0/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L107-L127
    
    --
    Zameer Manji
    
    
       

Reply via email to