Re: Cluster status instability

2015-04-08 Thread Erik Forsberg
To elaborate a bit on what Marcin said:

* Once a node starts to believe that a few other nodes are down, it seems
to stay that way for a very long time (hours). I'm not even sure it will
recover without a restart.
* I've tried to stop then start gossip with nodetool on the node that
thinks several other nodes is down. Did not help.
* nodetool gossipinfo when run on an affected node claims STATUS:NORMAL for
all nodes (including the ones marked as down in status output)
* It is quite possible that the problem starts at the time of day when we
have a lot of bulkloading going on. But why does it then stay for several
hours after the load goes down?
* I have the feeling this started with our upgrade from 1.2.18 to 2.0.12
about a month ago, but I have no hard data to back that up.

Regarding region/snitch - this is not an AWS deployment, we run on our own
datacenter with GossipingPropertyFileSnitch.

Right now I have this situation with one node (04-05) thinking that there
are 4 nodes down. The rest of the cluster (56 nodes in total) thinks all
nodes are up. Load on cluster right now is minimal, there's no GC going on.
Heap usage is approximately 3.5/6Gb.

root@cssa04-05:~# nodetool status|grep DN
DN  2001:4c28:1:413:0:1:2:5   1.07 TB256 1.8%
114ff46e-57d0-40dd-87fb-3e4259e96c16  rack2
DN  2001:4c28:1:413:0:1:2:6   1.06 TB256 1.8%
b161a6f3-b940-4bba-9aa3-cfb0fc1fe759  rack2
DN  2001:4c28:1:413:0:1:2:13  896.82 GB  256 1.6%
4a488366-0db9-4887-b538-4c5048a6d756  rack2
DN  2001:4c28:1:413:0:1:3:7   1.04 TB256 1.8%
95cf2cdb-d364-4b30-9b91-df4c37f3d670  rack3

Excerpt from nodetool gossipinfo showing one node that status thinks is
down (2:5) and one that status thinks is up (3:12):

/2001:4c28:1:413:0:1:2:5
  generation:1427712750
  heartbeat:2310212
  NET_VERSION:7
  RPC_ADDRESS:0.0.0.0
  RELEASE_VERSION:2.0.13
  RACK:rack2
  LOAD:1.172524771195E12
  INTERNAL_IP:2001:4c28:1:413:0:1:2:5
  HOST_ID:114ff46e-57d0-40dd-87fb-3e4259e96c16
  DC:iceland
  SEVERITY:0.0
  STATUS:NORMAL,100493381707736523347375230104768602825
  SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda
/2001:4c28:1:413:0:1:3:12
  generation:1427714889
  heartbeat:2305710
  NET_VERSION:7
  RPC_ADDRESS:0.0.0.0
  RELEASE_VERSION:2.0.13
  RACK:rack3
  LOAD:1.047542503234E12
  INTERNAL_IP:2001:4c28:1:413:0:1:3:12
  HOST_ID:bb20ddcb-0a14-4d91-b90d-fb27536d6b00
  DC:iceland
  SEVERITY:0.0
  STATUS:NORMAL,100163259989151698942931348962560111256
  SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda

I also tried disablegossip + enablegossip on 02-05 to see if that made
04-05 mark it as up, with no success.

Please let me know what other debug information I can provide.

Regards,
\EF

On Thu, Apr 2, 2015 at 6:56 PM, daemeon reiydelle daeme...@gmail.com
wrote:

 Do you happen to be using a tool like Nagios or Ganglia that are able to
 report utilization (CPU, Load, disk io, network)? There are plugins for
 both that will also notify you of (depending on whether you enabled the
 intermediate GC logging) about what is happening.



 On Thu, Apr 2, 2015 at 8:35 AM, Jan cne...@yahoo.com wrote:

 Marcin  ;

 are all your nodes within the same Region   ?
 If not in the same region,   what is the Snitch type that you are using
 ?

 Jan/



   On Thursday, April 2, 2015 3:28 AM, Michal Michalski 
 michal.michal...@boxever.com wrote:


 Hey Marcin,

 Are they actually going up and down repeatedly (flapping) or just down
 and they never come back?
 There might be different reasons for flapping nodes, but to list what I
 have at the top of my head right now:

 1. Network issues. I don't think it's your case, but you can read about
 the issues some people are having when deploying C* on AWS EC2 (keyword to
 look for: phi_convict_threshold)

 2. Heavy load. Node is under heavy load because of massive number of
 reads / writes / bulkloads or e.g. unthrottled compaction etc., which may
 result in extensive GC.

 Could any of these be a problem in your case? I'd start from
 investigating GC logs e.g. to see how long does the stop the world full
 GC take (GC logs should be on by default from what I can see [1])

 [1] https://issues.apache.org/jira/browse/CASSANDRA-5319

 Michał


 Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com

 On 2 April 2015 at 11:05, Marcin Pietraszek mpietras...@opera.com
 wrote:

 Hi!

 We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
 installed. Assume we have nodes A, B, C, D, E. On some irregular basis
 one of those nodes starts to report that subset of other nodes is in
 DN state although C* deamon on all nodes is running:

 A$ nodetool status
 UN B
 DN C
 DN D
 UN E

 B$ nodetool status
 UN A
 UN C
 UN D
 UN E

 C$ nodetool status
 DN A
 UN B
 UN D
 UN E

 After restart of A node, C and D report that A it's in UN and also A
 claims that whole cluster is in UN state. Right now I don't have any
 clear steps to reproduce that situation, do you guys have any idea
 what could be causing such behaviour? How this 

Re: Cluster status instability

2015-04-02 Thread Michal Michalski
Hey Marcin,

Are they actually going up and down repeatedly (flapping) or just down and
they never come back?
There might be different reasons for flapping nodes, but to list what I
have at the top of my head right now:

1. Network issues. I don't think it's your case, but you can read about the
issues some people are having when deploying C* on AWS EC2 (keyword to look
for: phi_convict_threshold)

2. Heavy load. Node is under heavy load because of massive number of reads
/ writes / bulkloads or e.g. unthrottled compaction etc., which may result
in extensive GC.

Could any of these be a problem in your case? I'd start from investigating
GC logs e.g. to see how long does the stop the world full GC take (GC
logs should be on by default from what I can see [1])

[1] https://issues.apache.org/jira/browse/CASSANDRA-5319

Michał


Kind regards,
Michał Michalski,
michal.michal...@boxever.com

On 2 April 2015 at 11:05, Marcin Pietraszek mpietras...@opera.com wrote:

 Hi!

 We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
 installed. Assume we have nodes A, B, C, D, E. On some irregular basis
 one of those nodes starts to report that subset of other nodes is in
 DN state although C* deamon on all nodes is running:

 A$ nodetool status
 UN B
 DN C
 DN D
 UN E

 B$ nodetool status
 UN A
 UN C
 UN D
 UN E

 C$ nodetool status
 DN A
 UN B
 UN D
 UN E

 After restart of A node, C and D report that A it's in UN and also A
 claims that whole cluster is in UN state. Right now I don't have any
 clear steps to reproduce that situation, do you guys have any idea
 what could be causing such behaviour? How this could be prevented?

 It seems like when A node is a coordinator and gets request for some
 data being replicated on C and D it respond with Unavailable
 exception, after restarting A that problem disapears.

 --
 mp



Re: Cluster status instability

2015-04-02 Thread Jan
Marcin  ; 
are all your nodes within the same Region   ?   If not in the same region,   
what is the Snitch type that you are using   ? 
Jan/ 


 On Thursday, April 2, 2015 3:28 AM, Michal Michalski 
michal.michal...@boxever.com wrote:
   

 Hey Marcin,
Are they actually going up and down repeatedly (flapping) or just down and they 
never come back?There might be different reasons for flapping nodes, but to 
list what I have at the top of my head right now:
1. Network issues. I don't think it's your case, but you can read about the 
issues some people are having when deploying C* on AWS EC2 (keyword to look 
for: phi_convict_threshold)
2. Heavy load. Node is under heavy load because of massive number of reads / 
writes / bulkloads or e.g. unthrottled compaction etc., which may result in 
extensive GC.
Could any of these be a problem in your case? I'd start from investigating GC 
logs e.g. to see how long does the stop the world full GC take (GC logs 
should be on by default from what I can see [1])
[1] https://issues.apache.org/jira/browse/CASSANDRA-5319
Michał

Kind regards,Michał Michalski,michal.michal...@boxever.com
On 2 April 2015 at 11:05, Marcin Pietraszek mpietras...@opera.com wrote:

Hi!

We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
installed. Assume we have nodes A, B, C, D, E. On some irregular basis
one of those nodes starts to report that subset of other nodes is in
DN state although C* deamon on all nodes is running:

A$ nodetool status
UN B
DN C
DN D
UN E

B$ nodetool status
UN A
UN C
UN D
UN E

C$ nodetool status
DN A
UN B
UN D
UN E

After restart of A node, C and D report that A it's in UN and also A
claims that whole cluster is in UN state. Right now I don't have any
clear steps to reproduce that situation, do you guys have any idea
what could be causing such behaviour? How this could be prevented?

It seems like when A node is a coordinator and gets request for some
data being replicated on C and D it respond with Unavailable
exception, after restarting A that problem disapears.

--
mp




  

Re: Cluster status instability

2015-04-02 Thread daemeon reiydelle
Do you happen to be using a tool like Nagios or Ganglia that are able to
report utilization (CPU, Load, disk io, network)? There are plugins for
both that will also notify you of (depending on whether you enabled the
intermediate GC logging) about what is happening.



On Thu, Apr 2, 2015 at 8:35 AM, Jan cne...@yahoo.com wrote:

 Marcin  ;

 are all your nodes within the same Region   ?
 If not in the same region,   what is the Snitch type that you are using
 ?

 Jan/



   On Thursday, April 2, 2015 3:28 AM, Michal Michalski 
 michal.michal...@boxever.com wrote:


 Hey Marcin,

 Are they actually going up and down repeatedly (flapping) or just down and
 they never come back?
 There might be different reasons for flapping nodes, but to list what I
 have at the top of my head right now:

 1. Network issues. I don't think it's your case, but you can read about
 the issues some people are having when deploying C* on AWS EC2 (keyword to
 look for: phi_convict_threshold)

 2. Heavy load. Node is under heavy load because of massive number of reads
 / writes / bulkloads or e.g. unthrottled compaction etc., which may result
 in extensive GC.

 Could any of these be a problem in your case? I'd start from investigating
 GC logs e.g. to see how long does the stop the world full GC take (GC
 logs should be on by default from what I can see [1])

 [1] https://issues.apache.org/jira/browse/CASSANDRA-5319

 Michał


 Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com

 On 2 April 2015 at 11:05, Marcin Pietraszek mpietras...@opera.com wrote:

 Hi!

 We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
 installed. Assume we have nodes A, B, C, D, E. On some irregular basis
 one of those nodes starts to report that subset of other nodes is in
 DN state although C* deamon on all nodes is running:

 A$ nodetool status
 UN B
 DN C
 DN D
 UN E

 B$ nodetool status
 UN A
 UN C
 UN D
 UN E

 C$ nodetool status
 DN A
 UN B
 UN D
 UN E

 After restart of A node, C and D report that A it's in UN and also A
 claims that whole cluster is in UN state. Right now I don't have any
 clear steps to reproduce that situation, do you guys have any idea
 what could be causing such behaviour? How this could be prevented?

 It seems like when A node is a coordinator and gets request for some
 data being replicated on C and D it respond with Unavailable
 exception, after restarting A that problem disapears.

 --
 mp