[
https://issues.apache.org/jira/browse/CASSANDRA-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Edward Capriolo reopened CASSANDRA-1776:
----------------------------------------
I may have explained poorly. On two occasions ~20 minutes after I see this in
the logs Cassandra on this node is at 100% user+system on all cores. The entire
cluster quickly degrades. Many pending messages in the Gossip stage and the
entire cluster is 100% CPU on all cores. The only course of action is to bring
down the entire cluster, or if you catch the problem early enough bring down
multiple nodes at a time.
> Untrapped exceptions in ThreadPool have a variety of ill effects
> ----------------------------------------------------------------
>
> Key: CASSANDRA-1776
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1776
> Project: Cassandra
> Issue Type: Bug
> Affects Versions: 0.6.5
> Reporter: Edward Capriolo
> Attachments: logs
>
>
> I have seen a variety of conditions that keep the Cassandra process running
> even though it mostly failed. At times the node stays up sending gossip
> messages so other nodes think the node is up. In the worst case condition a
> node gets in a tight loop fully utilizing 16 cores of a system and sending
> gossip messages that cause cascading issues across the cluster.
> I have seen untrapped OOM errors. The interesting part of the attached log
> is that we are not using super columns. I also have machines that come up out
> of a 40 second garbage collect, (I assume they gossip themselves as UP)
> messages then go back into a garbage collect to repeat again.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.