Erik, the scenario you're describing is almost identical to what we've been
experiencing. Sounds like you've been pulling your hair out too! You're also
running the same distro and kernel as us. And we also run without swap.
Which begs the question... what version of libc6 are you running!? Here's
the output from one of our upgraded boxes:

$ dpkg --list | grep libc6
ii  libc6                                    2.11.1-0ubuntu7.7
  Embedded GNU C Library: Shared libraries
ii  libc6-dev                                2.11.1-0ubuntu7.7
  Embedded GNU C Library: Development Librarie

Before upgrading the version field showed 2.11.1-0ubuntu7.5. Wondering what
yours is.

We also found ourselves in a similar situation with different regions. We're
using the canonical ubuntu ami as the base for our systems. But there appear
to be small differences between the packages included in the amis from
different regions. Seems libc6 is one of the things that changed. I
discovered by diff'ing `dpkg --list` on a node that was good, and one that
was bad.

The architecture hypothesis is also very interesting. If we could reproduce
the bug with the latest libc6 build I'd escalate it back up to Amazon. But I
can't repro it, so nothing to escalate.

For what it's worth, we were able to reproduce the lockup behavior that
you're describing by running a tight loop that spawns threads. Here's a gist
of the app I used: https://gist.github.com/a4123705e67e9446f1cc -- I'd be
interested to know whether that locks things up on your system with a new
libc6.

Mike

On Thu, Jan 13, 2011 at 10:39 PM, Erik Onnen <eon...@gmail.com> wrote:

> May or may not be related but I thought I'd recount a similar experience we
> had in EC2 in hopes it helps someone else.
>
> As background, we had been running several servers in a 0.6.8 ring with no
> Cassandra issues (some EC2 issues, but none related to Cassandra) on
> multiple EC2 XL instances in a single availability zone. We decided to add
> several other nodes to a second AZ for reasons beyond the scope of this
> email. As we reached steady operational state in the new AZ, we noticed that
> the new nodes in the new AZ were repeatedly getting dropped from the ring.
> At first we attributed the drops to phi and expected cross-AZ latency. As we
> tried to pinpoint the issue, we found something very similar to what you
> describe - the EC2 VMs in the new AZ would become completely unresponsive.
> Not just the Java process hosting Cassandra, but the entire host. Shell
> commands would not execute for existing sessions, we could not establish new
> SSH sessions and tails we had on active files wouldn't show any progress. It
> appeared as if the machines in the new AZ would seize for several minutes,
> then come back to life with little rhyme or reason as to why. Tickets opened
> with AMZN resulted in responses of "the physical server looks normal".
>
> After digging deeper, here's what we found. To confirm all nodes in both
> AZs were identical at the following levels:
> * Kernel (2.6.32-305-ec2 #9-Ubuntu SMP) distro (Ubuntu 10.04.1 LTS) and
> glibc on x86_64
> * All nodes were running identical Java distributions that we deployed
> ourselves, sun 1.6.0_22-b04
> * Same amount of virtualized RAM visible to the guest, same RAID stripe
> configuration across the same size/number of ephemeral drives
>
> We noticed two things that were different across the VMs in the two AZs:
> * The class of CPU exposed to the guest OSes across the two AZs (and
> presumably the same physical server above that guest).
> ** On hosts in the AZ not having issues, we see from the guest older
> Harpertown class Intel CPUs: "model name      : Intel(R) Xeon(R) CPU
>     E5430  @ 2.66GHz"
> ** On hosts in the AZ having issues, we see from the guest newer Nehalem
> class Intel CPUs: "model name      : Intel(R) Xeon(R) CPU           E5507
>  @ 2.27GHz"
> * Percent steal was consistently higher on the new nodes, on average 25%
> where as the older (stable) VMs were around 9% at peak load
>
> Consistently in our case, we only saw this seizing behavior on guests
> running on the newer Nehalem architecture CPUs.
>
> In digging a bit deeper on the problem machines, we also noticed the
> following:
> * Most of the time, ParNew GC on the problematic hosts was fine, averaging
> around .04 "real" seconds. After spending time tuning the generations and
> heap size for our workload, we rarely have CMS collections and almost never
> have Full GCs, even during full or anti-compactions.
> * Rarely, and at the same time as the problematic machines would seize, a
> long running ParNew collection would be recorded after the guest came back
> to life. Consistently this was between 180 and 220 seconds regardless of
> host, plenty of time for that host to be shunned from the ring.
>
> The long ParNew GCs were a mystery. They *never* happened on the hosts in
> the other AZ (the Harpertown class) and rarely happened on the new guests
> but we did observe the behavior within three hours of normal operation on
> each host in the new AZ.
>
> After lots of trial and error, we decided to remove ParNew collections from
> the equation and tried running a host in the new AZ with "-XX:-UseParNewGC"
> and this eliminated the long ParNew problem. The flip side is, we now do
> serial collections on the young generation for half our ring which means
> those nodes spend about 4x more time in GC than the other  nodes, but
> they've been stable for two weeks since the change.
>
> That's what we know for sure and we're back to operating without a hitch
> with the one JVM option change.
>
> <editorial>
> What I think is happening is more complicated. Skip this part if you don't
> care about opinion and some of this reasoning is surely incorrect. In
> talking with multiple VMWare experts (I don't have much experience in Xen
> but I imagine the same is true there as well), it's generally a bad idea to
> virtualize too many cores (two seems to be the sweet spot). Reason being
> that if you have a heavily multithreaded application and that app relies on
> consistent application of memory barriers across multiple cores (as Java
> does), the Hypervisor has to wait for multiple physical cores to become
> available before it schedules the guest so that each virtual core gets a
> consistent view of the virtual memory while scheduled. If the physical
> server is overcommitted, that wait time is exacerbated as the guest waits
> for the correct number of physical cores to become available (4 in our
> case). It's possible to tell this in VMware via esxtop, not sure in Xen. It
> would also be somewhat visible via %steal increases in the guest which we
> saw, but that doesn't really explain a two minute pause during garbage
> collection. My guess then, is that one or more of the following are at play
> in this scenario:
>
> 1) a core nehalem bug - the nehalem architecture made a lot of changes to
> the way it manges TLBs for memory, largely as a virtualization optimization.
> I doubt this is the case but assuming the guest isn't seeing a different
> architecture, we did see this issue only on E5507 processors.
> 2) the physical servers in the new AZ are drastically overcommitted -
>  maybe AMZN bought into the notion that Nehalems are better at
> virtualization and is allowing more guests to run on physical servers
> running Nehalems. I've no idea, just a hypothesis.
> 3) a hypervisor bug - I've deployed large JVMs to big physical Nehalem
> boxen running Cassandra clusters under high load and never seen behavior
> like the above. If I could see more of what the hypervisor was doing I'd
> have a pretty good idea here, but such is life in the cloud.
>
> </editorial>
>
> I also should say that I don't think any issues we had were at all related
> specifically to Cassandra. We were running fine in the first AZ, no problems
> other than needing to grow capacity. Only when we saw the different
> architecture in the new EC2 AZ did we experience problems and when we
> shackled the new generation collector, the bad problems went away.
>
> Sorry for the long tirade. This was originally going to be a blog post but
> I though it would have more value in context here. I hope ultimately it
> helps someone else.
> -erik
>
>
> On Thu, Jan 13, 2011 at 5:26 PM, Mike Malone <m...@simplegeo.com> wrote:
>
>> Hey folks,
>>
>> We've discovered an issue on Ubuntu/Lenny with libc6 2.11.1-0ubuntu7.5 (it
>> may also affect versions between 2.11.1-0ubuntu7.1 and 2.11.1-0ubuntu7.4).
>> The bug affects systems when a large number of threads (or processes) are
>> created rapidly. Once triggered, the system will become completely
>> unresponsive for ten to fifteen minutes. We've seen this issue on our
>> production Cassandra clusters under high load. Cassandra seems particularly
>> susceptible to this issue because of the large thread pools that it creates.
>> In particular, we suspect the unbounded thread pool for connection
>> management may be pushing some systems over the edge.
>>
>> We're still trying to narrow down what changed in libc that is causing
>> this issue. We also haven't tested things outside of xen, or on non-x86
>> architectures. But if you're seeing these symptoms, you may want to try
>> upgrading libc6.
>>
>> I'll send out an update if we find anything else interesting. If anyone
>> has any thoughts as to what the cause is, we're all ears!
>>
>> Hope this saves someone some heart-ache,
>>
>> Mike
>>
>
>

Reply via email to