Do you happen to be using a tool like Nagios or Ganglia that are able to
report utilization (CPU, Load, disk io, network)? There are plugins for
both that will also notify you of (depending on whether you enabled the
intermediate GC logging) about what is happening.



On Thu, Apr 2, 2015 at 8:35 AM, Jan <cne...@yahoo.com> wrote:

> Marcin  ;
>
> are all your nodes within the same Region   ?
> If not in the same region,   what is the Snitch type that you are using
> ?
>
> Jan/
>
>
>
>   On Thursday, April 2, 2015 3:28 AM, Michal Michalski <
> michal.michal...@boxever.com> wrote:
>
>
> Hey Marcin,
>
> Are they actually going up and down repeatedly (flapping) or just down and
> they never come back?
> There might be different reasons for flapping nodes, but to list what I
> have at the top of my head right now:
>
> 1. Network issues. I don't think it's your case, but you can read about
> the issues some people are having when deploying C* on AWS EC2 (keyword to
> look for: phi_convict_threshold)
>
> 2. Heavy load. Node is under heavy load because of massive number of reads
> / writes / bulkloads or e.g. unthrottled compaction etc., which may result
> in extensive GC.
>
> Could any of these be a problem in your case? I'd start from investigating
> GC logs e.g. to see how long does the "stop the world" full GC take (GC
> logs should be on by default from what I can see [1])
>
> [1] https://issues.apache.org/jira/browse/CASSANDRA-5319
>
> Michał
>
>
> Kind regards,
> Michał Michalski,
> michal.michal...@boxever.com
>
> On 2 April 2015 at 11:05, Marcin Pietraszek <mpietras...@opera.com> wrote:
>
> Hi!
>
> We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
> installed. Assume we have nodes A, B, C, D, E. On some irregular basis
> one of those nodes starts to report that subset of other nodes is in
> DN state although C* deamon on all nodes is running:
>
> A$ nodetool status
> UN B
> DN C
> DN D
> UN E
>
> B$ nodetool status
> UN A
> UN C
> UN D
> UN E
>
> C$ nodetool status
> DN A
> UN B
> UN D
> UN E
>
> After restart of A node, C and D report that A it's in UN and also A
> claims that whole cluster is in UN state. Right now I don't have any
> clear steps to reproduce that situation, do you guys have any idea
> what could be causing such behaviour? How this could be prevented?
>
> It seems like when A node is a coordinator and gets request for some
> data being replicated on C and D it respond with Unavailable
> exception, after restarting A that problem disapears.
>
> --
> mp
>
>
>
>
>

Reply via email to