Re: Cassandra Node keep going down

2017-07-17 Thread Jeff Jirsa


On 2017-07-14 11:23 (-0700), "Harika Vangapelli -T (hvangape - AKRAYA INC at 
Cisco)"
 wrote: 
> We are using Cassandra 3.x version..
> 

Which 3.x version? 3.11.0? 3.0.14? 3.7? Exact version is important. 

> Recently, our production database is going through some instability issues. 
> One of our node is keep going down from every 2 days up to a few of times a 
> day. The node is down due to JVM out of memory. According to my 
> investigation, I suspect that this might be related to the writing and/or 
> running compaction of the large partitions for some of our large data tables. 
> Here's might be what had happened
> 1. The node went OOM due to unable to de-serialize or compacting some large 
> partitions under some condition due to memory constrains.
> 2. Once we re-started it, which was usually a few hours later, the other 
> nodes in the cluster were trying to perform the hinted handoff to the down 
> node to patch the missing data. From now on, the down node would have to 
> handle handoff plus the normal data load, which made it even busier.
> 3. The node was not able to complete the handoff and went down again.
> 4. This went again and again.
> 

Sounds like it's always the same node? You may want to try running 'nodetool 
scrub' on that node and watching logs for errors that may indicate a corrupt 
file on disk, which would cause the behavior you're seeing.


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Cassandra Node keep going down

2017-07-14 Thread Harika Vangapelli -T (hvangape - AKRAYA INC at Cisco)
We are using Cassandra 3.x version..

Recently, our production database is going through some instability issues. One 
of our node is keep going down from every 2 days up to a few of times a day. 
The node is down due to JVM out of memory. According to my investigation, I 
suspect that this might be related to the writing and/or running compaction of 
the large partitions for some of our large data tables. Here's might be what 
had happened
1. The node went OOM due to unable to de-serialize or compacting some large 
partitions under some condition due to memory constrains.
2. Once we re-started it, which was usually a few hours later, the other nodes 
in the cluster were trying to perform the hinted handoff to the down node to 
patch the missing data. From now on, the down node would have to handle handoff 
plus the normal data load, which made it even busier.
3. The node was not able to complete the handoff and went down again.
4. This went again and again.

This was not the first time we're seeing this issue. The last time, we fixed 
the issue by manually stopping some of aggregation jobs for a whole night to 
allow the node to complete the handoff. We're not too sure about the root cause 
yet, and we don't have explanation why this happens only to one node. I 
investigated the issue and found two related JIRAs of Cassandra
https://issues.apache.org/jira/browse/CASSANDRA-8269 and
https://issues.apache.org/jira/browse/CASSANDRA-8723

Both JIRA mentioned that this might only be the case with Cassandra 2.x.

Thanks,

Harika


[http://wwwin.cisco.com/c/dam/cec/organizations/gmcc/services-tools/signaturetool/images/logo/logo_gradient.png]



Harika Vangapelli
Engineer - IT
hvang...@cisco.com
Tel:

Cisco Systems, Inc.



United States
cisco.com


[http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif]Think before you 
print.

This email may contain confidential and privileged material for the sole use of 
the intended recipient. Any review, use, distribution or disclosure by others 
is strictly prohibited. If you are not the intended recipient (or authorized to 
receive for the recipient), please contact the sender by reply email and delete 
all copies of this message.
Please click 
here for 
Company Registration Information.