Re: Rebooting one Cassandra node caused all the application nodes go down

2019-07-19 Thread Rahul Reddy
Sorry no corruption errors.

Thanks Jeff,

Anything specific to look into if this happens again




On Fri, Jul 19, 2019, 2:40 PM Nitan Kainth  wrote:

> You no corruption error or you see corruption error?
>
>
> Regards,
>
> Nitan
>
> Cell: 510 449 9629
>
> On Jul 19, 2019, at 1:52 PM, Rahul Reddy  wrote:
>
> Schema matches and corruption errors in system.log
>
> On Fri, Jul 19, 2019, 1:33 PM Nitan Kainth  wrote:
>
>> Do you see schemat in sync? Nodetool describecluster.
>>
>> Check system log for any corruption.
>>
>>
>> Regards,
>>
>> Nitan
>>
>> Cell: 510 449 9629
>>
>> On Jul 19, 2019, at 12:32 PM, ZAIDI, ASAD A  wrote:
>>
>> “aws asked to set nvme_timeout to higher number in etc/grub.conf.”
>>
>>
>>
>> Did you ask AWS if setting higher value is real solution to bug - Is
>> there not any patch available to address the bug?   - just curios to know
>>
>>
>>
>> *From:* Rahul Reddy [mailto:rahulreddy1...@gmail.com
>> ]
>> *Sent:* Friday, July 19, 2019 10:49 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Rebooting one Cassandra node caused all the application nodes
>> go down
>>
>>
>>
>> Here ,
>>
>>
>>
>> We have 6 nodes each in 2 data centers us-east-1 and us-west-2  . We have
>> RF 3 and  cl set to local quorum. And gossip snitch. All our instance are
>> c5.2xlarge and data files and comit logs are stored in gp2 ebs.  C5
>> instance type had a bug which aws asked to set nvme_timeout to higher
>> number in etc/grub.conf. after setting the parameter and did run nodetool
>> drain and reboot the node in east
>>
>>
>>
>> Instance cameup but Cassandra didn't come up normal had to start the
>> Cassandra. Cassandra cameup but it shows other instances down. Even though
>> didn't reboot the other node down same was observed in one other node. How
>> could that happen and don't any errors in system.log which is set to info.
>>
>> Without any intervention gossip settled in 10 mins entire cluster became
>> normal.
>>
>>
>>
>> Tried same thing West it happened again
>>
>>
>>
>>
>>
>>
>>
>> I'm concerned how to check what caused it and if a reboot happens again
>> how to avoid this.
>>
>>  If I just  STOP Cassandra instead of reboot I don't see this issue.
>>
>>
>>
>>


Re: Rebooting one Cassandra node caused all the application nodes go down

2019-07-19 Thread Nitan Kainth
You no corruption error or you see corruption error?


Regards,
Nitan
Cell: 510 449 9629

> On Jul 19, 2019, at 1:52 PM, Rahul Reddy  wrote:
> 
> Schema matches and corruption errors in system.log
> 
>> On Fri, Jul 19, 2019, 1:33 PM Nitan Kainth  wrote:
>> Do you see schemat in sync? Nodetool describecluster.
>> 
>> Check system log for any corruption.
>> 
>> 
>> Regards,
>> Nitan
>> Cell: 510 449 9629
>> 
>>> On Jul 19, 2019, at 12:32 PM, ZAIDI, ASAD A  wrote:
>>> 
>>> “aws asked to set nvme_timeout to higher number in etc/grub.conf.”
>>> 
>>>  
>>> 
>>> Did you ask AWS if setting higher value is real solution to bug - Is there 
>>> not any patch available to address the bug?   - just curios to know
>>> 
>>>  
>>> 
>>> From: Rahul Reddy [mailto:rahulreddy1...@gmail.com] 
>>> Sent: Friday, July 19, 2019 10:49 AM
>>> To: user@cassandra.apache.org
>>> Subject: Rebooting one Cassandra node caused all the application nodes go 
>>> down
>>> 
>>>  
>>> 
>>> Here ,
>>> 
>>>  
>>> 
>>> We have 6 nodes each in 2 data centers us-east-1 and us-west-2  . We have 
>>> RF 3 and  cl set to local quorum. And gossip snitch. All our instance are 
>>> c5.2xlarge and data files and comit logs are stored in gp2 ebs.  C5 
>>> instance type had a bug which aws asked to set nvme_timeout to higher 
>>> number in etc/grub.conf. after setting the parameter and did run nodetool 
>>> drain and reboot the node in east
>>> 
>>>  
>>> 
>>> Instance cameup but Cassandra didn't come up normal had to start the 
>>> Cassandra. Cassandra cameup but it shows other instances down. Even though 
>>> didn't reboot the other node down same was observed in one other node. How 
>>> could that happen and don't any errors in system.log which is set to info.
>>> 
>>> Without any intervention gossip settled in 10 mins entire cluster became 
>>> normal.
>>> 
>>>  
>>> 
>>> Tried same thing West it happened again
>>> 
>>>  
>>> 
>>>  
>>> 
>>>  
>>> 
>>> I'm concerned how to check what caused it and if a reboot happens again how 
>>> to avoid this.
>>> 
>>>  If I just  STOP Cassandra instead of reboot I don't see this issue.
>>> 
>>>  


Re: Rebooting one Cassandra node caused all the application nodes go down

2019-07-19 Thread Rahul Reddy
Schema matches and corruption errors in system.log

On Fri, Jul 19, 2019, 1:33 PM Nitan Kainth  wrote:

> Do you see schemat in sync? Nodetool describecluster.
>
> Check system log for any corruption.
>
>
> Regards,
>
> Nitan
>
> Cell: 510 449 9629
>
> On Jul 19, 2019, at 12:32 PM, ZAIDI, ASAD A  wrote:
>
> “aws asked to set nvme_timeout to higher number in etc/grub.conf.”
>
>
>
> Did you ask AWS if setting higher value is real solution to bug - Is there
> not any patch available to address the bug?   - just curios to know
>
>
>
> *From:* Rahul Reddy [mailto:rahulreddy1...@gmail.com
> ]
> *Sent:* Friday, July 19, 2019 10:49 AM
> *To:* user@cassandra.apache.org
> *Subject:* Rebooting one Cassandra node caused all the application nodes
> go down
>
>
>
> Here ,
>
>
>
> We have 6 nodes each in 2 data centers us-east-1 and us-west-2  . We have
> RF 3 and  cl set to local quorum. And gossip snitch. All our instance are
> c5.2xlarge and data files and comit logs are stored in gp2 ebs.  C5
> instance type had a bug which aws asked to set nvme_timeout to higher
> number in etc/grub.conf. after setting the parameter and did run nodetool
> drain and reboot the node in east
>
>
>
> Instance cameup but Cassandra didn't come up normal had to start the
> Cassandra. Cassandra cameup but it shows other instances down. Even though
> didn't reboot the other node down same was observed in one other node. How
> could that happen and don't any errors in system.log which is set to info.
>
> Without any intervention gossip settled in 10 mins entire cluster became
> normal.
>
>
>
> Tried same thing West it happened again
>
>
>
>
>
>
>
> I'm concerned how to check what caused it and if a reboot happens again
> how to avoid this.
>
>  If I just  STOP Cassandra instead of reboot I don't see this issue.
>
>
>
>


Re: Rebooting one Cassandra node caused all the application nodes go down

2019-07-19 Thread Nitan Kainth
Do you see schemat in sync? Nodetool describecluster.

Check system log for any corruption.


Regards,
Nitan
Cell: 510 449 9629

> On Jul 19, 2019, at 12:32 PM, ZAIDI, ASAD A  wrote:
> 
> “aws asked to set nvme_timeout to higher number in etc/grub.conf.”
>  
> Did you ask AWS if setting higher value is real solution to bug - Is there 
> not any patch available to address the bug?   - just curios to know
>  
> From: Rahul Reddy [mailto:rahulreddy1...@gmail.com] 
> Sent: Friday, July 19, 2019 10:49 AM
> To: user@cassandra.apache.org
> Subject: Rebooting one Cassandra node caused all the application nodes go down
>  
> Here ,
>  
> We have 6 nodes each in 2 data centers us-east-1 and us-west-2  . We have RF 
> 3 and  cl set to local quorum. And gossip snitch. All our instance are 
> c5.2xlarge and data files and comit logs are stored in gp2 ebs.  C5 instance 
> type had a bug which aws asked to set nvme_timeout to higher number in 
> etc/grub.conf. after setting the parameter and did run nodetool drain and 
> reboot the node in east
>  
> Instance cameup but Cassandra didn't come up normal had to start the 
> Cassandra. Cassandra cameup but it shows other instances down. Even though 
> didn't reboot the other node down same was observed in one other node. How 
> could that happen and don't any errors in system.log which is set to info.
> Without any intervention gossip settled in 10 mins entire cluster became 
> normal.
>  
> Tried same thing West it happened again
>  
>  
>  
> I'm concerned how to check what caused it and if a reboot happens again how 
> to avoid this.
>  If I just  STOP Cassandra instead of reboot I don't see this issue.
>  


Re: Rebooting one Cassandra node caused all the application nodes go down

2019-07-19 Thread Jeff Jirsa
Could be something like
https://issues.apache.org/jira/browse/CASSANDRA-14358

Hard to say after the fact.


On Fri, Jul 19, 2019 at 8:49 AM Rahul Reddy 
wrote:

> Here ,
>
> We have 6 nodes each in 2 data centers us-east-1 and us-west-2  . We have
> RF 3 and  cl set to local quorum. And gossip snitch. All our instance are
> c5.2xlarge and data files and comit logs are stored in gp2 ebs.  C5
> instance type had a bug which aws asked to set nvme_timeout to higher
> number in etc/grub.conf. after setting the parameter and did run nodetool
> drain and reboot the node in east
>
> Instance cameup but Cassandra didn't come up normal had to start the
> Cassandra. Cassandra cameup but it shows other instances down. Even though
> didn't reboot the other node down same was observed in one other node. How
> could that happen and don't any errors in system.log which is set to info.
> Without any intervention gossip settled in 10 mins entire cluster became
> normal.
>
> Tried same thing West it happened again
>
>
>
> I'm concerned how to check what caused it and if a reboot happens again
> how to avoid this.
>  If I just  STOP Cassandra instead of reboot I don't see this issue.
>
>


RE: Rebooting one Cassandra node caused all the application nodes go down

2019-07-19 Thread ZAIDI, ASAD A

https://lore.kernel.org/patchwork/patch/884501/
according to this link , it sound like having patched kernel is real solution 
to bug



From: Rahul Reddy [mailto:rahulreddy1...@gmail.com]
Sent: Friday, July 19, 2019 11:48 AM
To: user@cassandra.apache.org
Subject: Re: Rebooting one Cassandra node caused all the application nodes go 
down

Raj,

No that was not the case in system.log I see the started listening to call 
client at 16:42 but some how it still unreachable to 16:50 below grafana 
dashboard shows it. Once everything up in logs why would it still show down in 
nodetool status and grafana.

Zaidi,

In latest aws Linux Ami they took care of this bug . And also changing the Ami 
needs rebuild of all the nodes so didn't took that route.

On Fri, Jul 19, 2019, 12:32 PM ZAIDI, ASAD A 
mailto:az1...@att.com>> wrote:
“aws asked to set nvme_timeout to higher number in etc/grub.conf.”

Did you ask AWS if setting higher value is real solution to bug - Is there not 
any patch available to address the bug?   - just curios to know

From: Rahul Reddy 
[mailto:rahulreddy1...@gmail.com<mailto:rahulreddy1...@gmail.com>]
Sent: Friday, July 19, 2019 10:49 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Rebooting one Cassandra node caused all the application nodes go down

Here ,

We have 6 nodes each in 2 data centers us-east-1 and us-west-2  . We have RF 3 
and  cl set to local quorum. And gossip snitch. All our instance are c5.2xlarge 
and data files and comit logs are stored in gp2 ebs.  C5 instance type had a 
bug which aws asked to set nvme_timeout to higher number in etc/grub.conf. 
after setting the parameter and did run nodetool drain and reboot the node in 
east

Instance cameup but Cassandra didn't come up normal had to start the Cassandra. 
Cassandra cameup but it shows other instances down. Even though didn't reboot 
the other node down same was observed in one other node. How could that happen 
and don't any errors in system.log which is set to info.
Without any intervention gossip settled in 10 mins entire cluster became normal.

Tried same thing West it happened again



I'm concerned how to check what caused it and if a reboot happens again how to 
avoid this.
 If I just  STOP Cassandra instead of reboot I don't see this issue.



Re: Rebooting one Cassandra node caused all the application nodes go down

2019-07-19 Thread Rajsekhar Mallick
Hello Rahul,
 As per your description, Cassandra process is up and running as you
verified from the logs.
But nodetool and grafana arnt fetching data.
This points to the suspect being jmx port 7199.

Do run and check 'netstat -anp | egrep"7199|9042|7070" ' on the impacted
and other hosts in the cluster.
There has to be some difference . Observe The ip address to which the jmx
port 7199 is binding to. Is it the same as it was prior to reboot.

Thanks


On Fri, 19 Jul, 2019, 10:28 PM Rahul Reddy, 
wrote:

> Raj,
>
> No that was not the case in system.log I see the started listening to call
> client at 16:42 but some how it still unreachable to 16:50 below grafana
> dashboard shows it. Once everything up in logs why would it still show down
> in nodetool status and grafana.
>
> Zaidi,
>
> In latest aws Linux Ami they took care of this bug . And also changing the
> Ami needs rebuild of all the nodes so didn't took that route.
>
> On Fri, Jul 19, 2019, 12:32 PM ZAIDI, ASAD A  wrote:
>
>> “aws asked to set nvme_timeout to higher number in etc/grub.conf.”
>>
>>
>>
>> Did you ask AWS if setting higher value is real solution to bug - Is
>> there not any patch available to address the bug?   - just curios to know
>>
>>
>>
>> *From:* Rahul Reddy [mailto:rahulreddy1...@gmail.com]
>> *Sent:* Friday, July 19, 2019 10:49 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Rebooting one Cassandra node caused all the application nodes
>> go down
>>
>>
>>
>> Here ,
>>
>>
>>
>> We have 6 nodes each in 2 data centers us-east-1 and us-west-2  . We have
>> RF 3 and  cl set to local quorum. And gossip snitch. All our instance are
>> c5.2xlarge and data files and comit logs are stored in gp2 ebs.  C5
>> instance type had a bug which aws asked to set nvme_timeout to higher
>> number in etc/grub.conf. after setting the parameter and did run nodetool
>> drain and reboot the node in east
>>
>>
>>
>> Instance cameup but Cassandra didn't come up normal had to start the
>> Cassandra. Cassandra cameup but it shows other instances down. Even though
>> didn't reboot the other node down same was observed in one other node. How
>> could that happen and don't any errors in system.log which is set to info.
>>
>> Without any intervention gossip settled in 10 mins entire cluster became
>> normal.
>>
>>
>>
>> Tried same thing West it happened again
>>
>>
>>
>>
>>
>>
>>
>> I'm concerned how to check what caused it and if a reboot happens again
>> how to avoid this.
>>
>>  If I just  STOP Cassandra instead of reboot I don't see this issue.
>>
>>
>>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org


RE: Rebooting one Cassandra node caused all the application nodes go down

2019-07-19 Thread ZAIDI, ASAD A
“aws asked to set nvme_timeout to higher number in etc/grub.conf.”

Did you ask AWS if setting higher value is real solution to bug - Is there not 
any patch available to address the bug?   - just curios to know

From: Rahul Reddy [mailto:rahulreddy1...@gmail.com]
Sent: Friday, July 19, 2019 10:49 AM
To: user@cassandra.apache.org
Subject: Rebooting one Cassandra node caused all the application nodes go down

Here ,

We have 6 nodes each in 2 data centers us-east-1 and us-west-2  . We have RF 3 
and  cl set to local quorum. And gossip snitch. All our instance are c5.2xlarge 
and data files and comit logs are stored in gp2 ebs.  C5 instance type had a 
bug which aws asked to set nvme_timeout to higher number in etc/grub.conf. 
after setting the parameter and did run nodetool drain and reboot the node in 
east

Instance cameup but Cassandra didn't come up normal had to start the Cassandra. 
Cassandra cameup but it shows other instances down. Even though didn't reboot 
the other node down same was observed in one other node. How could that happen 
and don't any errors in system.log which is set to info.
Without any intervention gossip settled in 10 mins entire cluster became normal.

Tried same thing West it happened again



I'm concerned how to check what caused it and if a reboot happens again how to 
avoid this.
 If I just  STOP Cassandra instead of reboot I don't see this issue.



Re: Rebooting one Cassandra node caused all the application nodes go down

2019-07-19 Thread Rajsekhar Mallick
Hello Rahul,

Basically the issue is running nodetool status on the host rebooted node,
shows itself as UN and all other nodes in the cluster as DN.

And running nodetool status on any other node in the cluster shows this
rebooted node as DN.
Correct me if I am wrong. Is this the issue.
Also attach screenshot of the observation you are talking about. You may
choose to replace the ip address of the hosts

Thanks

On Fri, 19 Jul, 2019, 9:36 PM Rahul Reddy,  wrote:

> Thanks for quick response rajshekar.
>
> Correct same cassandra.yml and same java
>
> On Fri, Jul 19, 2019, 11:56 AM Rajsekhar Mallick 
> wrote:
>
>> Hello Rahul,
>>
>> May you please confirm on below things.
>>
>> 1. Cassandra.yaml file of the node which was started after the machine
>> reboot is same as that of rest of the nodes in the cluster.
>> 2. Java version is consistent across all nodes in the cluster.
>>
>> Do check and revert
>>
>> Thanks
>>
>> On Fri, 19 Jul, 2019, 9:19 PM Rahul Reddy, 
>> wrote:
>>
>>> Here ,
>>>
>>> We have 6 nodes each in 2 data centers us-east-1 and us-west-2  . We
>>> have RF 3 and  cl set to local quorum. And gossip snitch. All our instance
>>> are c5.2xlarge and data files and comit logs are stored in gp2 ebs.  C5
>>> instance type had a bug which aws asked to set nvme_timeout to higher
>>> number in etc/grub.conf. after setting the parameter and did run nodetool
>>> drain and reboot the node in east
>>>
>>> Instance cameup but Cassandra didn't come up normal had to start the
>>> Cassandra. Cassandra cameup but it shows other instances down. Even though
>>> didn't reboot the other node down same was observed in one other node. How
>>> could that happen and don't any errors in system.log which is set to info.
>>> Without any intervention gossip settled in 10 mins entire cluster became
>>> normal.
>>>
>>> Tried same thing West it happened again
>>>
>>>
>>>
>>> I'm concerned how to check what caused it and if a reboot happens again
>>> how to avoid this.
>>>  If I just  STOP Cassandra instead of reboot I don't see this issue.
>>>
>>>


Re: Rebooting one Cassandra node caused all the application nodes go down

2019-07-19 Thread Rahul Reddy
Thanks for quick response rajshekar.

Correct same cassandra.yml and same java

On Fri, Jul 19, 2019, 11:56 AM Rajsekhar Mallick 
wrote:

> Hello Rahul,
>
> May you please confirm on below things.
>
> 1. Cassandra.yaml file of the node which was started after the machine
> reboot is same as that of rest of the nodes in the cluster.
> 2. Java version is consistent across all nodes in the cluster.
>
> Do check and revert
>
> Thanks
>
> On Fri, 19 Jul, 2019, 9:19 PM Rahul Reddy, 
> wrote:
>
>> Here ,
>>
>> We have 6 nodes each in 2 data centers us-east-1 and us-west-2  . We have
>> RF 3 and  cl set to local quorum. And gossip snitch. All our instance are
>> c5.2xlarge and data files and comit logs are stored in gp2 ebs.  C5
>> instance type had a bug which aws asked to set nvme_timeout to higher
>> number in etc/grub.conf. after setting the parameter and did run nodetool
>> drain and reboot the node in east
>>
>> Instance cameup but Cassandra didn't come up normal had to start the
>> Cassandra. Cassandra cameup but it shows other instances down. Even though
>> didn't reboot the other node down same was observed in one other node. How
>> could that happen and don't any errors in system.log which is set to info.
>> Without any intervention gossip settled in 10 mins entire cluster became
>> normal.
>>
>> Tried same thing West it happened again
>>
>>
>>
>> I'm concerned how to check what caused it and if a reboot happens again
>> how to avoid this.
>>  If I just  STOP Cassandra instead of reboot I don't see this issue.
>>
>>


Re: Rebooting one Cassandra node caused all the application nodes go down

2019-07-19 Thread Rajsekhar Mallick
Hello Rahul,

May you please confirm on below things.

1. Cassandra.yaml file of the node which was started after the machine
reboot is same as that of rest of the nodes in the cluster.
2. Java version is consistent across all nodes in the cluster.

Do check and revert

Thanks

On Fri, 19 Jul, 2019, 9:19 PM Rahul Reddy,  wrote:

> Here ,
>
> We have 6 nodes each in 2 data centers us-east-1 and us-west-2  . We have
> RF 3 and  cl set to local quorum. And gossip snitch. All our instance are
> c5.2xlarge and data files and comit logs are stored in gp2 ebs.  C5
> instance type had a bug which aws asked to set nvme_timeout to higher
> number in etc/grub.conf. after setting the parameter and did run nodetool
> drain and reboot the node in east
>
> Instance cameup but Cassandra didn't come up normal had to start the
> Cassandra. Cassandra cameup but it shows other instances down. Even though
> didn't reboot the other node down same was observed in one other node. How
> could that happen and don't any errors in system.log which is set to info.
> Without any intervention gossip settled in 10 mins entire cluster became
> normal.
>
> Tried same thing West it happened again
>
>
>
> I'm concerned how to check what caused it and if a reboot happens again
> how to avoid this.
>  If I just  STOP Cassandra instead of reboot I don't see this issue.
>
>


Rebooting one Cassandra node caused all the application nodes go down

2019-07-19 Thread Rahul Reddy
Here ,

We have 6 nodes each in 2 data centers us-east-1 and us-west-2  . We have
RF 3 and  cl set to local quorum. And gossip snitch. All our instance are
c5.2xlarge and data files and comit logs are stored in gp2 ebs.  C5
instance type had a bug which aws asked to set nvme_timeout to higher
number in etc/grub.conf. after setting the parameter and did run nodetool
drain and reboot the node in east

Instance cameup but Cassandra didn't come up normal had to start the
Cassandra. Cassandra cameup but it shows other instances down. Even though
didn't reboot the other node down same was observed in one other node. How
could that happen and don't any errors in system.log which is set to info.
Without any intervention gossip settled in 10 mins entire cluster became
normal.

Tried same thing West it happened again



I'm concerned how to check what caused it and if a reboot happens again how
to avoid this.
 If I just  STOP Cassandra instead of reboot I don't see this issue.