May or may not be related but I thought I'd recount a similar experience we
had in EC2 in hopes it helps someone else.

As background, we had been running several servers in a 0.6.8 ring with no
Cassandra issues (some EC2 issues, but none related to Cassandra) on
multiple EC2 XL instances in a single availability zone. We decided to add
several other nodes to a second AZ for reasons beyond the scope of this
email. As we reached steady operational state in the new AZ, we noticed that
the new nodes in the new AZ were repeatedly getting dropped from the ring.
At first we attributed the drops to phi and expected cross-AZ latency. As we
tried to pinpoint the issue, we found something very similar to what you
describe - the EC2 VMs in the new AZ would become completely unresponsive.
Not just the Java process hosting Cassandra, but the entire host. Shell
commands would not execute for existing sessions, we could not establish new
SSH sessions and tails we had on active files wouldn't show any progress. It
appeared as if the machines in the new AZ would seize for several minutes,
then come back to life with little rhyme or reason as to why. Tickets opened
with AMZN resulted in responses of "the physical server looks normal".

After digging deeper, here's what we found. To confirm all nodes in both AZs
were identical at the following levels:
* Kernel (2.6.32-305-ec2 #9-Ubuntu SMP) distro (Ubuntu 10.04.1 LTS) and
glibc on x86_64
* All nodes were running identical Java distributions that we deployed
ourselves, sun 1.6.0_22-b04
* Same amount of virtualized RAM visible to the guest, same RAID stripe
configuration across the same size/number of ephemeral drives

We noticed two things that were different across the VMs in the two AZs:
* The class of CPU exposed to the guest OSes across the two AZs (and
presumably the same physical server above that guest).
** On hosts in the AZ not having issues, we see from the guest older
Harpertown class Intel CPUs: "model name      : Intel(R) Xeon(R) CPU
  E5430  @ 2.66GHz"
** On hosts in the AZ having issues, we see from the guest newer Nehalem
class Intel CPUs: "model name      : Intel(R) Xeon(R) CPU           E5507  @
2.27GHz"
* Percent steal was consistently higher on the new nodes, on average 25%
where as the older (stable) VMs were around 9% at peak load

Consistently in our case, we only saw this seizing behavior on guests
running on the newer Nehalem architecture CPUs.

In digging a bit deeper on the problem machines, we also noticed the
following:
* Most of the time, ParNew GC on the problematic hosts was fine, averaging
around .04 "real" seconds. After spending time tuning the generations and
heap size for our workload, we rarely have CMS collections and almost never
have Full GCs, even during full or anti-compactions.
* Rarely, and at the same time as the problematic machines would seize, a
long running ParNew collection would be recorded after the guest came back
to life. Consistently this was between 180 and 220 seconds regardless of
host, plenty of time for that host to be shunned from the ring.

The long ParNew GCs were a mystery. They *never* happened on the hosts in
the other AZ (the Harpertown class) and rarely happened on the new guests
but we did observe the behavior within three hours of normal operation on
each host in the new AZ.

After lots of trial and error, we decided to remove ParNew collections from
the equation and tried running a host in the new AZ with "-XX:-UseParNewGC"
and this eliminated the long ParNew problem. The flip side is, we now do
serial collections on the young generation for half our ring which means
those nodes spend about 4x more time in GC than the other  nodes, but
they've been stable for two weeks since the change.

That's what we know for sure and we're back to operating without a hitch
with the one JVM option change.

<editorial>
What I think is happening is more complicated. Skip this part if you don't
care about opinion and some of this reasoning is surely incorrect. In
talking with multiple VMWare experts (I don't have much experience in Xen
but I imagine the same is true there as well), it's generally a bad idea to
virtualize too many cores (two seems to be the sweet spot). Reason being
that if you have a heavily multithreaded application and that app relies on
consistent application of memory barriers across multiple cores (as Java
does), the Hypervisor has to wait for multiple physical cores to become
available before it schedules the guest so that each virtual core gets a
consistent view of the virtual memory while scheduled. If the physical
server is overcommitted, that wait time is exacerbated as the guest waits
for the correct number of physical cores to become available (4 in our
case). It's possible to tell this in VMware via esxtop, not sure in Xen. It
would also be somewhat visible via %steal increases in the guest which we
saw, but that doesn't really explain a two minute pause during garbage
collection. My guess then, is that one or more of the following are at play
in this scenario:

1) a core nehalem bug - the nehalem architecture made a lot of changes to
the way it manges TLBs for memory, largely as a virtualization optimization.
I doubt this is the case but assuming the guest isn't seeing a different
architecture, we did see this issue only on E5507 processors.
2) the physical servers in the new AZ are drastically overcommitted -  maybe
AMZN bought into the notion that Nehalems are better at virtualization and
is allowing more guests to run on physical servers running Nehalems. I've no
idea, just a hypothesis.
3) a hypervisor bug - I've deployed large JVMs to big physical Nehalem boxen
running Cassandra clusters under high load and never seen behavior like the
above. If I could see more of what the hypervisor was doing I'd have a
pretty good idea here, but such is life in the cloud.

</editorial>

I also should say that I don't think any issues we had were at all related
specifically to Cassandra. We were running fine in the first AZ, no problems
other than needing to grow capacity. Only when we saw the different
architecture in the new EC2 AZ did we experience problems and when we
shackled the new generation collector, the bad problems went away.

Sorry for the long tirade. This was originally going to be a blog post but I
though it would have more value in context here. I hope ultimately it helps
someone else.
-erik


On Thu, Jan 13, 2011 at 5:26 PM, Mike Malone <m...@simplegeo.com> wrote:

> Hey folks,
>
> We've discovered an issue on Ubuntu/Lenny with libc6 2.11.1-0ubuntu7.5 (it
> may also affect versions between 2.11.1-0ubuntu7.1 and 2.11.1-0ubuntu7.4).
> The bug affects systems when a large number of threads (or processes) are
> created rapidly. Once triggered, the system will become completely
> unresponsive for ten to fifteen minutes. We've seen this issue on our
> production Cassandra clusters under high load. Cassandra seems particularly
> susceptible to this issue because of the large thread pools that it creates.
> In particular, we suspect the unbounded thread pool for connection
> management may be pushing some systems over the edge.
>
> We're still trying to narrow down what changed in libc that is causing this
> issue. We also haven't tested things outside of xen, or on non-x86
> architectures. But if you're seeing these symptoms, you may want to try
> upgrading libc6.
>
> I'll send out an update if we find anything else interesting. If anyone has
> any thoughts as to what the cause is, we're all ears!
>
> Hope this saves someone some heart-ache,
>
> Mike
>

Reply via email to