Re: aurora replica log snapshot interval

Erb, Stephan Thu, 04 Jun 2015 03:15:13 -0700

We also used to have timeout issues on a smallish but VM-based test cluster. We 
ended up doing the following:


* run Zookeeper and Aurora with concurrent mark and sweep collector (without 
any further tuning)
* doubled the Aurora native log read and write timeouts. This is in line with 
the default 10sec timeout as used by Mesos.

Regards,
Stephan
________________________________________
From: Bhuvan Arumugam <[email protected]>
Sent: Tuesday, June 2, 2015 7:51 PM
To: [email protected]
Subject: Re: aurora replica log snapshot interval

On Tue, Jun 2, 2015 at 10:25 AM, Maxim Khutornenko <[email protected]> wrote:
> Hi Bhuvan,
>
> We have never had to change the native_log_write timeout from its
> default value but we have definitely seen problems with scheduler
> failovers related to snapshotting. It is indeed an IO intensive
> operation that may and will block all other activities especially when
> overlapped with a backup creation. During the snapshot creation an
> exclusive write lock is held making all other mutation operations
> impossible. Reads may still be served though.

Thank you, Maxim!
Useful info indeed! We'll refrain from changing the snapshot interval.

> I would suggest a more thorough investigation to make sure it was
> truly a native_log_write timeout that caused your failover.

Yes, we confirmed it's due to write timeout, in this case:
     Caused by: java.util.concurrent.TimeoutException: Timed out while
attempting to append

> Identifying the root cause is crucial here as we have seen two major
> causes for failovers: excessive GC activity leading to ZK timeouts and
> slow disk IO blocking writes in underlying native log storage. Below
> are a few leads:
>
> Excessive GC:
> - consider using snapshot de-duplication [1] if you are not already
> using it. This has helped us significantly reduce GC activity and
> stored snapshot size.

Interesting. Right now, we haven't enabled snapshots de-dup. We'll enable it.

> - consider finely tuning your GC perf. It's not an easy task but there
> are plenty of online resources to help (e.g. [2]).
>
> Excessive IO:
> - consider changing your underlying system IO scheduler. By just
> switching from cfq to deadline we have virtually eliminated our
> failovers due to excessive IO. See AURORA-1211 for details.

Sure. We are using cfq i/o scheduler in our scheduler hosts. We'll
investigate if changing to deadline improve the situation.

>
> [1] - https://github.com/apache/aurora/blob/master/docs/scheduler-storage.md
> [2] - 
> http://www.cubrid.org/blog/dev-platform/how-to-tune-java-garbage-collection/
>
> On Tue, Jun 2, 2015 at 9:33 AM, Bhuvan Arumugam <[email protected]> wrote:
>> Hello,
>>
>> In a 300 nodes cluster with 5 scheduler in the quorum, the replica log
>> writes fail due to timeout (native_log_write_timeout: 3secs)
>> especially when 50+ tasks are flapping. The next leader takes around
>> 2mins+ to complete the log replay and become active. The service is
>> inaccessible to users, as aurora isn't yet listening on the port.
>> Users face 503 errors. Why? The snapshot wasn't taken during last few
>> hours because the crash happen within configured snapshot interval
>> (default: 1 hour).
>>
>> We bumped the log write timeout and in parallel investigating the
>> reason for timeout, whether it's due to bad hardware, etc. In the
>> meantime, we want to reduce service disruption to the users by
>> bringing down the replay time. I like to know,
>>
>> a) is reducing snapshot interval (dlog_snapshot_interval) to 30 mins
>> the right thing to do
>> b) it snapshot event i/o intensive?
>> c) it takes 0-6 seconds to snapshot 10k events, from last snapshot.
>> does the scheduler block user requests when snapshot is in progress?
>>
>> Thank you,
>> --
>> Regards,
>> Bhuvan Arumugam
>> www.livecipher.com



--
Regards,
Bhuvan Arumugam
www.livecipher.com

Re: aurora replica log snapshot interval

Reply via email to