We also used to have timeout issues on a smallish but VM-based test cluster. We ended up doing the following:
* run Zookeeper and Aurora with concurrent mark and sweep collector (without any further tuning) * doubled the Aurora native log read and write timeouts. This is in line with the default 10sec timeout as used by Mesos. Regards, Stephan ________________________________________ From: Bhuvan Arumugam <[email protected]> Sent: Tuesday, June 2, 2015 7:51 PM To: [email protected] Subject: Re: aurora replica log snapshot interval On Tue, Jun 2, 2015 at 10:25 AM, Maxim Khutornenko <[email protected]> wrote: > Hi Bhuvan, > > We have never had to change the native_log_write timeout from its > default value but we have definitely seen problems with scheduler > failovers related to snapshotting. It is indeed an IO intensive > operation that may and will block all other activities especially when > overlapped with a backup creation. During the snapshot creation an > exclusive write lock is held making all other mutation operations > impossible. Reads may still be served though. Thank you, Maxim! Useful info indeed! We'll refrain from changing the snapshot interval. > I would suggest a more thorough investigation to make sure it was > truly a native_log_write timeout that caused your failover. Yes, we confirmed it's due to write timeout, in this case: Caused by: java.util.concurrent.TimeoutException: Timed out while attempting to append > Identifying the root cause is crucial here as we have seen two major > causes for failovers: excessive GC activity leading to ZK timeouts and > slow disk IO blocking writes in underlying native log storage. Below > are a few leads: > > Excessive GC: > - consider using snapshot de-duplication [1] if you are not already > using it. This has helped us significantly reduce GC activity and > stored snapshot size. Interesting. Right now, we haven't enabled snapshots de-dup. We'll enable it. > - consider finely tuning your GC perf. It's not an easy task but there > are plenty of online resources to help (e.g. [2]). > > Excessive IO: > - consider changing your underlying system IO scheduler. By just > switching from cfq to deadline we have virtually eliminated our > failovers due to excessive IO. See AURORA-1211 for details. Sure. We are using cfq i/o scheduler in our scheduler hosts. We'll investigate if changing to deadline improve the situation. > > [1] - https://github.com/apache/aurora/blob/master/docs/scheduler-storage.md > [2] - > http://www.cubrid.org/blog/dev-platform/how-to-tune-java-garbage-collection/ > > On Tue, Jun 2, 2015 at 9:33 AM, Bhuvan Arumugam <[email protected]> wrote: >> Hello, >> >> In a 300 nodes cluster with 5 scheduler in the quorum, the replica log >> writes fail due to timeout (native_log_write_timeout: 3secs) >> especially when 50+ tasks are flapping. The next leader takes around >> 2mins+ to complete the log replay and become active. The service is >> inaccessible to users, as aurora isn't yet listening on the port. >> Users face 503 errors. Why? The snapshot wasn't taken during last few >> hours because the crash happen within configured snapshot interval >> (default: 1 hour). >> >> We bumped the log write timeout and in parallel investigating the >> reason for timeout, whether it's due to bad hardware, etc. In the >> meantime, we want to reduce service disruption to the users by >> bringing down the replay time. I like to know, >> >> a) is reducing snapshot interval (dlog_snapshot_interval) to 30 mins >> the right thing to do >> b) it snapshot event i/o intensive? >> c) it takes 0-6 seconds to snapshot 10k events, from last snapshot. >> does the scheduler block user requests when snapshot is in progress? >> >> Thank you, >> -- >> Regards, >> Bhuvan Arumugam >> www.livecipher.com -- Regards, Bhuvan Arumugam www.livecipher.com
