On 2017-07-14 11:23 (-0700), "Harika Vangapelli -T (hvangape - AKRAYA INC at Cisco)" <hvang...@cisco.com> wrote: > We are using Cassandra 3.x version.. >
Which 3.x version? 3.11.0? 3.0.14? 3.7? Exact version is important. > Recently, our production database is going through some instability issues. > One of our node is keep going down from every 2 days up to a few of times a > day. The node is down due to JVM out of memory. According to my > investigation, I suspect that this might be related to the writing and/or > running compaction of the large partitions for some of our large data tables. > Here's might be what had happened > 1. The node went OOM due to unable to de-serialize or compacting some large > partitions under some condition due to memory constrains. > 2. Once we re-started it, which was usually a few hours later, the other > nodes in the cluster were trying to perform the hinted handoff to the down > node to patch the missing data. From now on, the down node would have to > handle handoff plus the normal data load, which made it even busier. > 3. The node was not able to complete the handoff and went down again. > 4. This went again and again. > Sounds like it's always the same node? You may want to try running 'nodetool scrub' on that node and watching logs for errors that may indicate a corrupt file on disk, which would cause the behavior you're seeing. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org