Re: A mini postmortem on snapshot failures

Erb, Stephan Wed, 04 Jan 2017 07:11:57 -0800

I am not aware of tickets or work on the _exact_ problem described in this 
thread.


However, there is currently some on-going work to improve scheduler performance 
and availability during snaphshots on large clusters:

• https://reviews.apache.org/r/54883/  (still pending)
• https://reviews.apache.org/r/54774/ with associated ticket 
https://issues.apache.org/jira/browse/AURORA-1861 (already merged)

Maybe those are also interesting for you.


On 31/12/2016, 00:03, "meghdoot bhattacharya" <meghdoo...@yahoo.com.INVALID> 
wrote:

    Is there a ticket for this? In our large test cluster, we hit into this 
under heavy load.
    Thx
    
          From: John Sirois <jsir...@apache.org>
     To: dev@aurora.apache.org 
     Sent: Wednesday, October 5, 2016 11:04 AM
     Subject: Re: A mini postmortem on snapshot failures
       
    On Tue, Oct 4, 2016 at 11:55 AM, Joshua Cohen <jco...@apache.org> wrote:
    
    > Hi Zameer,
    >
    > Thanks for this writeup!
    >
    > I think one other option to consider would be using a connection for
    > writing the snapshots that's not bound by the pool's maximum checkout 
time.
    > I'm not sure if this is feasible or not, but I worry that there's
    > potentially no upper bound on raising the maximum checkout time as the 
size
    > of a cluster grows or its read traffic grows. It feels a bit heavy weight
    > to up the max checkout time when potentially the only connection exceeding
    > the limits is the one writing the snapshot.
    >
    
    +1 to the above, it would be great if the snapshotter thread could grab its
    own dedicated connection.  In fact, since the operation is relatively rare
    (compared to all other reads and writes), it could even grab and dispose of
    a new connection per-snapshot to simplify things (ie: no need to check for
    a stale connection like you would have to if you grabbed one and tried to
    hold it for the lifetime of the scheduler process).
    
    
    > I'd definitely be in favor of adding a flag to tune the maximum connection
    > checkout.
    >
    > I'm neutral to negative on having Snapshot creation failures be fatal, I
    > don't necessarily think that one failed snapshot should take the scheduler
    > down, but I agree that some number of consecutive failures is a Bad Thing
    > that is worthy of investigation. My concern with having failures be fatal
    > is the pathological case where snapshots always fail causing your 
scheduler
    > to failover once every SNAPSHOT_INTERVAL. Do you think it would be
    > sufficient to add `scheduler_log_snapshots` to the list of important
    > stats[1]?
    >
    > I'm also neutral on changing the defaults. I'm not sure if it's warranted,
    > as the behavior will vary based on cluster. It seems like you guys got bit
    > by this due to a comparatively heavy read load? Our cluster, on the other
    > hand, is probably significantly larger, but is not queried as much, and we
    > haven't run into issues with the defaults. However, as long as there are 
is
    > no adverse impact to bumping the default values I've got no objections.
    >
    > Cheers,
    >
    > Joshua
    >
    > [1]
    > https://github.com/apache/aurora/blob/master/docs/
    > operations/monitoring.md#important-stats
    >
    > On Fri, Sep 30, 2016 at 7:34 PM, Zameer Manji <zma...@apache.org> wrote:
    >
    > > Aurora Developers and Users,
    > >
    > > I would like to share failure case I experienced recently. In a modestly
    > > sized production cluster with high read load, snapshot creation began to
    > > fail. The logs showed the following:
    > >
    > > ````
    > > W0923 00:23:55.528 [LogStorage-0, LogStorage:473]
    > > ### Error rolling back transaction.  Cause: java.sql.SQLException: Error
    > > accessing PooledConnection. Connection is invalid.
    > > ### Cause: java.sql.SQLException: Error accessing PooledConnection.
    > > Connection is invalid. org.apache.ibatis.exceptions.
    > PersistenceException:
    > > ### Error rolling back transaction.  Cause: java.sql.SQLException: Error
    > > accessing PooledConnection. Connection is invalid.
    > > ### Cause: java.sql.SQLException: Error accessing PooledConnection.
    > > Connection is invalid.
    > > at
    > > org.apache.ibatis.exceptions.ExceptionFactory.wrapException(
    > > ExceptionFactory.java:30)
    > > at
    > > org.apache.ibatis.session.defaults.DefaultSqlSession.
    > > rollback(DefaultSqlSession.java:216)
    > > at
    > > org.apache.ibatis.session.SqlSessionManager.rollback(
    > > SqlSessionManager.java:299)
    > > at
    > > org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(
    > > TransactionalMethodInterceptor.java:116)
    > > at
    > > org.apache.aurora.scheduler.storage.db.DbStorage.lambda$
    > > write$0(DbStorage.java:175)
    > > at
    > > org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(
    > > GatingDelayExecutor.java:62)
    > > at
    > > org.apache.aurora.scheduler.storage.db.DbStorage.write(
    > DbStorage.java:173)
    > > at
    > > org.apache.aurora.common.inject.TimedInterceptor.
    > > invoke(TimedInterceptor.java:83)
    > > at
    > > org.apache.aurora.scheduler.storage.log.LogStorage.
    > > doInTransaction(LogStorage.java:521)
    > > at
    > > org.apache.aurora.scheduler.storage.log.LogStorage.write(
    > > LogStorage.java:551)
    > > at
    > > org.apache.aurora.scheduler.storage.log.LogStorage.
    > > doSnapshot(LogStorage.java:489)
    > > at
    > > org.apache.aurora.common.inject.TimedInterceptor.
    > > invoke(TimedInterceptor.java:83)
    > > at
    > > org.apache.aurora.scheduler.storage.log.LogStorage.
    > > snapshot(LogStorage.java:565)
    > > at
    > > org.apache.aurora.scheduler.storage.log.LogStorage.lambda$
    > > scheduleSnapshots$20(LogStorage.java:468)
    > > at java.util.concurrent.Executors$RunnableAdapter.
    > call(Executors.java:511)
    > > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    > > at
    > > java.util.concurrent.ScheduledThreadPoolExecutor$
    > > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    > > at
    > > java.util.concurrent.ScheduledThreadPoolExecutor$
    > ScheduledFutureTask.run(
    > > ScheduledThreadPoolExecutor.java:294)
    > > at
    > > java.util.concurrent.ThreadPoolExecutor.runWorker(
    > > ThreadPoolExecutor.java:1142)
    > > at
    > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
    > > ThreadPoolExecutor.java:617)
    > > at java.lang.Thread.run(Thread.java:745)
    > > Caused by: java.sql.SQLException: Error accessing PooledConnection.
    > > Connection is invalid.
    > > at
    > > org.apache.ibatis.datasource.pooled.PooledConnection.checkConnection(
    > > PooledConnection.java:254)
    > > at
    > > org.apache.ibatis.datasource.pooled.PooledConnection.
    > > invoke(PooledConnection.java:243)
    > > at com.sun.proxy.$Proxy135.getAutoCommit(Unknown Source)
    > > at
    > > org.apache.ibatis.transaction.jdbc.JdbcTransaction.rollback(
    > > JdbcTransaction.java:79)
    > > at org.apache.ibatis.executor.BaseExecutor.rollback(
    > BaseExecutor.java:249)
    > > at
    > > org.apache.ibatis.executor.CachingExecutor.rollback(
    > > CachingExecutor.java:119)
    > > at
    > > org.apache.ibatis.session.defaults.DefaultSqlSession.
    > > rollback(DefaultSqlSession.java:213)
    > > ... 19 common frames omitted
    > > ````
    > >
    > > This failure is silent and can be observed only through the
    > > `scheduler_log_snapshots` metric, if it isn't increasing then snapshots
    > are
    > > not being created. In this cluster, a snapshot was not taken for about 4
    > > days.
    > > For those unfamiliar with Aurora's replicated log storage system,
    > snapshot
    > > creation is important because it allows us to truncate the number of
    > > entries in the replicated log to a single large entry. This is required
    > > because the log recovery time is proportional to the number of entries 
in
    > > the log. Operators can observe the amount of time it takes to recover 
the
    > > log at startup via the `scheduler_log_recover_nanos_total` metric.
    > >
    > > The largest value observed for `scheduler_log_recover_nanos_total`
    > during
    > > this period was 8 minutes. This means that recovery from a failover 
would
    > > take at least 8 minutes. For reference, a system aiming to have 99.99%
    > > uptime can only sustain 4 minutes of downtime a month.
    > >
    > > The root cause of this can be found from the exception in the above 
stack
    > > trace:
    > > `Caused by: java.sql.SQLException: Error accessing PooledConnection.
    > > Connection is invalid.`
    > > This originates from the MyBatis connection pool used to communicate 
with
    > > the in memory SQL store. To create a snapshot, we run a `SCRIPT` query 
to
    > > dump the entire database into the replicated log [1].
    > >
    > > This exception is being thrown because we have a connection pool to
    > > communicate with the H2 SQL database. By default the connection pool has
    > > the following properties:
    > > * Maximum 10 active connections
    > > * Maximum connection time of 20s before being considered for eviction.
    > >
    > > Under high read load, there can be many pending SQL queries for a
    > > connection. If a single connection takes more than 20s it will likely be
    > > evicted. In this case running one of the `SCRIPT` queries was taking 
more
    > > than 20s and there were many pending queries which caused MyBatis to
    > evict
    > > the connection for the `SCRIPT` query, causing snapshot creation 
failure.
    > >
    > > To fix this issue, operators used the `-db_max_active_connection_count`
    > to
    > > increase the maximum number of active connections for MyBatis to 100.
    > Once
    > > the scheduler was able to serve requests, operators used `aurora_admin
    > > scheduler_snapshot` to force create a snapshot. Then a scheduler 
failover
    > > was induced and it was observed that recovery time dropped to about 40
    > > seconds.
    > >
    > > Today this cluster continues running with this flag and value to ensure
    > it
    > > can continue to serve a high read load.
    > >
    > > I would like to raise three questions:
    > > * Should we add a flag to tune the maximum connection time for MyBatis?
    > > * Should a Snapshot creation failure be fatal?
    > > * Should we change the default maximum connection time and maximum 
number
    > > of active connections?
    > >
    > > [1]:
    > > https://github.com/apache/aurora/blob/rel/0.16.0/src/
    > > main/java/org/apache/aurora/scheduler/storage/log/
    > > SnapshotStoreImpl.java#L107-L127
    > >
    > > --
    > > Zameer Manji
    > >
    >

Re: A mini postmortem on snapshot failures

Reply via email to