Hello,

We're running a six-node 0.7.4 ring in EC2 on m1.xlarge instances with 4GB heap 
(15GB total memory, 4 cores, dataset fits in RAM, storage on ephemeral disk). 
We've noticed a brief flurry of query failures during the night corresponding 
with our backup schedule. More specifically, our logs suggest that calling 
"nodetool snapshot" on a node is triggering 12 to 16 second CMS GCs and a 
promotion failure resulting in a full stop-the-world collection, during which 
the node is marked dead by the ring until re-joining shortly after.

Here's a log from one of the nodes, along with system info and JVM options: 
https://gist.github.com/e12c6cae500e118676d1

At 13:15:00, our backup cron job runs, which calls nodetool flush, then 
nodetool snapshot. (After investigating, we noticed that calling both flush and 
snapshot is unnecessary, and have since updated the script to only call 
snapshot). While writing memtables, we'll generally see a GC logged out via 
Cassandra such as:

"GC for ConcurrentMarkSweep: 16113 ms, 1755422432 reclaimed leaving 1869123536 
used; max is 4424663040."

In the JVM GC logs, we'll often see a tenured promotion failure occurring 
during this collection, resulting in a full stop-the-world GC like this 
(different node):

1180629.380: [CMS1180634.414: [CMS-concurrent-mark: 6.041/6.468 secs] [Times: 
user=8.00 sys=0.10, real=6.46 secs] 
 (concurrent mode failure): 3904635K->1700629K(4109120K), 16.0548910 secs] 
3958389K->1700629K(4185792K), [CMS Perm : 19610K->19601K(32796K)], 16.1057040 
secs] [Times: user=14.39 sys=0.02, real=16.10 secs]

During the GC, the rest of the ring will shun the node, and when the collection 
completes, the node will mark all other hosts in the ring as dead. The node and 
ring stabilize shortly after once detecting each other as up and completing 
hinted handoff (details in log).

We've enabled JNA on one of the nodes to prevent forking a subprocess to call 
`ln` during a snapshot yesterday and still observed a concurrent mode failure 
collection following a flush/snapshot, but the CMS length was shorter (9 
seconds) and did not result in the node being shunned from the ring.

While the query failures that result from this activity are brief, our retry 
threshold is set to 6 for timeout exceptions. We're concerned that we're 
exceeding that, and would like to figure out why we see long CMS collections + 
promotion failures triggering full GCs during a snapshot.

Has anyone seen this, or have suggestions on how to prevent full GCs from 
occurring during a flush / snapshot?

Thanks,

- Scott

---

C. Scott Andreas
Engineer, Urban Airship, Inc.
http://www.urbanairship.com

Reply via email to