Re: Nodes becoming unresponsive

2020-02-06 Thread Erick Ramirez
>
> I tried to debug more and could see using top that Command is
> MutationStage in top output , Any clue we get from this ?
>

 That just means there's lots of writes hitting your cluster. Without the
thread dump, it would be difficult to know if the threads are blocked by
futex_wait or whatever else is going on. Cheers!


Re: Nodes becoming unresponsive

2020-02-06 Thread Surbhi Gupta
I have limited options to use JDK based tools because in our environment we
are running JRE .

I tried to debug more and could see using top that Command is MutationStage
in top output , Any clue we get from this ?

top - 16:30:47 up 94 days,  5:33,  1 user,  load average: 134.83, 142.48,
144.75
Tasks: 564 total,  58 running, 506 sleeping,   0 stopped,   0 zombie
Cpu(s): 95.2%us,  2.5%sy,  0.3%ni,  1.7%id,  0.0%wa,  0.0%hi,  0.3%si,
 0.0%st
Mem:  132236016k total, 131378384k used,   857632k free,   189208k buffers
Swap:0k total,0k used,0k free, 94530140k cached

   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  WCHAN
COMMAND

 11798 cassandr  20   0  261g  42g  14g R 14.4 33.3  76:47.38 -
MutationStage-1

 11762 cassandr  20   0  261g  42g  14g S 14.1 33.3  82:19.84 -
MutationStage-9

 11530 cassandr  20   0  261g  42g  14g R 13.8 33.3 100:00.22 -
MutationStage-3

 11501 cassandr  20   0  261g  42g  14g S 13.4 33.3   2598:38 -
MutationStage-5

 11688 cassandr  20   0  261g  42g  14g S 13.1 33.3  90:42.47 -
MutationStage-5

 11512 cassandr  20   0  261g  42g  14g R 12.8 33.3 153:13.59 -
MutationStage-1

 11534 cassandr  20   0  261g  42g  14g R 12.8 33.3 104:48.21 -
MutationStage-2

 11708 cassandr  20   0  261g  42g  14g S 12.5 33.3  87:17.64 -
MutationStage-6

 11783 cassandr  20   0  261g  42g  14g S 12.5 33.3  76:01.10 futex_wai
MutationStage-1

 11792 cassandr  20   0  261g  42g  14g S 12.5 33.3  76:19.90 futex_wai
MutationStage-1

 11504 cassandr  20   0  261g  42g  14g S 12.2 33.3 859:10.54 futex_wai
MutationStage-8

 11517 cassandr  20   0  261g  42g  14g R 12.2 33.3 116:18.38 -
MutationStage-2

 11535 cassandr  20   0  261g  42g  14g R 12.2 33.3  96:11.11 -
MutationStage-3

 11710 cassandr  20   0  261g  42g  14g R 12.2 33.3  86:50.77 -
MutationStage-7

 11730 cassandr  20   0  261g  42g  14g S 12.2 33.3  78:36.04 -
MutationStage-1

 11743 cassandr  20   0  261g  42g  14g R 12.2 33.3  80:27.18 -
MutationStage-1

 11773 cassandr  20   0  261g  42g  14g R 12.2 33.3  79:29.48 -
MutationStage-1

 11800 cassandr  20   0  261g  42g  14g S 12.2 33.3  77:01.39 futex_wai
MutationStage-1

 11830 cassandr  20   0  261g  42g  14g R 12.2 33.3  70:47.18 -
MutationStage-1

 11495 cassandr  20   0  261g  42g  14g R 11.8 33.3   7693:04 -
MutationStage-3

 11675 cassandr  20   0  261g  42g  14g R 11.8 33.3  94:13.22 -
MutationStage-4

 11683 cassandr  20   0  261g  42g  14g S 11.8 33.3  91:42.91 futex_wai
MutationStage-4

 11701 cassandr  20   0  261g  42g  14g S 11.8 33.3  85:16.00 -
MutationStage-7

 11703 cassandr  20   0  261g  42g  14g R 11.8 33.3  88:33.81 -
MutationStage-6

 11725 cassandr  20   0  261g  42g  14g R 11.8 33.3  78:12.70 -
MutationStage-1

 11752 cassandr  20   0  261g  42g  14g S 11.8 33.3  83:25.14 futex_wai
MutationStage-9

 11755 cassandr  20   0  261g  42g  14g R 11.8 33.3  82:38.87 -
MutationStage-9

 11776 cassandr  20   0  261g  42g  14g S 11.8 33.3  79:31.49 futex_wai
MutationStage-1

 11781 cassandr  20   0  261g  42g  14g R 11.8 33.3  75:01.54 -
MutationStage-1

 11796 cassandr  20   0  261g  42g  14g S 11.8 33.3  77:03.78 -
MutationStage-1

 11804 cassandr  20   0  261g  42g  14g R 11.8 33.3  81:38.46 -
MutationStage-1

 11818 cassandr  20   0  261g  42g  14g S 11.8 33.3  76:51.42 -
MutationStage-1

 11823 cassandr  20   0  261g  42g  14g R 11.8 33.3  75:56.69 -
MutationStage-1

 11506 cassandr  20   0  261g  42g  14g R 11.5 33.3 502:50.67 -
MutationStage-1

 11513 cassandr  20   0  261g  42g  14g R 11.5 33.3 140:00.60 -
MutationStage-1

 11515 cassandr  20   0  261g  42g  14g S 11.5 33.3 123:31.16 futex_wai
MutationStage-1

 11676 cassandr  20   0  261g  42g  14g S 11.5 33.3  93:44.36 futex_wai
MutationStage-4

 11680 cassandr  20   0  261g  42g  14g S 11.5 33.3  93:28.55 futex_wai
MutationStage-4

 11706 cassandr  20   0  261g  42g  14g R 11.5 33.3  89:17.10 -
MutationStage-6

 11729 cassandr  20   0  261g  42g  14g R 11.5 33.3  78:42.33 -
MutationStage-1


On Thu, 6 Feb 2020 at 10:17, Elliott Sims  wrote:

> Async-profiler (https://github.com/jvm-profiling-tools/async-profiler )
> flamegraphs can also be a really good tool to figure out the exact
> callgraph that's leading to the futex_wait, both in and out of the JVM.
>


Re: Nodes becoming unresponsive

2020-02-06 Thread Elliott Sims
Async-profiler (https://github.com/jvm-profiling-tools/async-profiler )
flamegraphs can also be a really good tool to figure out the exact
callgraph that's leading to the futex_wait, both in and out of the JVM.


Re: Nodes becoming unresponsive

2020-02-05 Thread Surbhi Gupta
Sure Eric...

I tried strace as well ...


Re: Nodes becoming unresponsive

2020-02-05 Thread Erick Ramirez
Surbhi, just a *friendly* reminder that it's customary to reply back to the
mailing list instead of emailing me directly so that everyone else in the
list can participate. ☺


> I tried taking thread dump using kill -3  but it just came back and
> no file generated.
> How do you take the thread dump?


The Swiss Java Knife (SJK) which Jeff referenced previously is a nice
utility for it:

... then yea, thread inspection is your next step.
> https://github.com/aragozin/jvm-tools/blob/master/sjk-core/docs/TTOP.md ,
> for example.
>
>>


Re: Nodes becoming unresponsive

2020-02-05 Thread Jeff Jirsa
The bug is in the kernel - it'd be worth looking at your specific kernel
via `uname -a` just to confirm you're not somehow running an old kernel. If
you're sure you're on a good kernel, then yea, thread inspection is your
next step.
https://github.com/aragozin/jvm-tools/blob/master/sjk-core/docs/TTOP.md ,
for example.



On Wed, Feb 5, 2020 at 8:37 PM Erick Ramirez  wrote:

> I wrote that article 5 years ago but I didn't think it would still be
> relevant today. 
>
> Have you tried to do a thread dump to see which are the most dominant
> threads? That's the most effective way of troubleshooting high CPU
> situations. Cheers!
>
>>


Re: Nodes becoming unresponsive

2020-02-05 Thread Erick Ramirez
I wrote that article 5 years ago but I didn't think it would still be
relevant today. 

Have you tried to do a thread dump to see which are the most dominant
threads? That's the most effective way of troubleshooting high CPU
situations. Cheers!

>


Nodes becoming unresponsive

2020-02-05 Thread Surbhi Gupta
Hi,

We have noticed in a Cassandra Cluster , one of the node has 100% cpu
utilization, using top we can see that cassandra process is showing
futex_wait .

We are on CentOS release 6.10 (Final)  .As per below document the futex bug
was on Centos 6.6 .
https://support.datastax.com/hc/en-us/articles/206259833-Nodes-appear-unresponsive-due-to-a-Linux-futex-wait-kernel-bug

Below are the installed patches.

sudo rpm -q --changelog kernel-`uname -r` | grep futex | grep ref

- [kernel] futex: Mention key referencing differences between shared and
private futexes (Larry Woodman) [1167405]

- [kernel] futex: Ensure get_futex_key_refs() always implies a barrier
(Larry Woodman) [1167405]

- [kernel] futex: Fix errors in nested key ref-counting (Denys Vlasenko)
[1094458] {CVE-2014-0205}

- [kernel] futex_lock_pi() key refcnt fix (Danny Feng) [566347]
{CVE-2010-0623}

top - 21:23:34 up 93 days, 10:43,  1 user,  load average: 137.35, 147.74,
148.52

Tasks: 658 total,   1 running, 657 sleeping,   0 stopped,   0 zombie

Cpu(s): 93.9%us,  1.9%sy,  2.0%ni,  2.0%id,  0.0%wa,  0.0%hi,  0.2%si,
0.0%st

Mem:  132236016k total, 129681568k used,  2554448k free,   215888k buffers

Swap:0k total,0k used,0k free, 93679880k cached


   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  WCHAN
COMMAND


  7725 cassandr  20   0  258g  40g  13g S 2302.0 32.4 305169:26 futex_wai
java


 69075 logstash  39  19 10.5g 1.5g  14m S 42.1  1.2   6763:00 futex_wai
java


 30008 root  20   0  465m  55m  11m S 11.5  0.0   0:02.78 poll_sche
TaniumClient


 31785 cassandr  20   0 34.9g  31m  10m S  4.9  0.0   0:00.15 futex_wai
java


  5154 root  20   0 1523m 6260 1300 S  3.0  0.0   1073:05 hrtimer_n
collectd


  1129 root  20   0 000 S  1.3  0.0 294:55.87 kjournald
jbd2/dm-0-8


 64173 root  20   0 1512m  71m  13m S  1.3  0.1   0:55.69 futex_wai
TaniumClient

Any idea , what else can be looked for high CPU issue?

Thanks
Surbhi


Nodes becoming unresponsive intermediately (Gossip stage pending)

2017-01-18 Thread Sermandurai Konar
Hi,

We have 11/11 node cluster running Cassandra 2.1.15 version.
We are observing that 3 nodes from each data center are becoming
unresponsive for short period of time.
This behavior is happening only in 6 nodes (i.e. 3 from each data center)
and we are seeing a lot of Gossip stage has pending task and periodic
commit log syncer issue. This is resulting in huge mutation drop and the
node also has high cpu usage compared to other nodes.

Log entry from system.log:
WARN  [GossipTasks:1] 2017-01-18 10:15:51,328  Gossiper.java:748 - Gossip
stage has 2 pending tasks; skipping status check (no nodes will be marked
down)
WARN  [GossipTasks:1] 2017-01-18 10:15:52,428  Gossiper.java:748 - Gossip
stage has 3 pending tasks; skipping status check (no nodes will be marked
down)
WARN  [GossipTasks:1] 2017-01-18 10:15:53,529  Gossiper.java:748 - Gossip
stage has 3 pending tasks; skipping status check (no nodes will be marked
down)
WARN  [GossipTasks:1] 2017-01-18 10:15:54,629  Gossiper.java:748 - Gossip
stage has 5 pending tasks; skipping status check (no nodes will be marked
down)
WARN  [GossipTasks:1] 2017-01-18 10:15:55,730  Gossiper.java:748 - Gossip
stage has 7 pending tasks; skipping status check (no nodes will be marked
down)
WARN  [GossipTasks:1] 2017-01-18 10:15:56,830  Gossiper.java:748 - Gossip
stage has 9 pending tasks; skipping status check (no nodes will be marked
down)
WARN  [GossipTasks:1] 2017-01-18 10:15:57,930  Gossiper.java:748 - Gossip
stage has 12 pending tasks; skipping status check (no nodes will be marked
down)
WARN  [GossipTasks:1] 2017-01-18 10:15:59,031  Gossiper.java:748 - Gossip
stage has 15 pending tasks; skipping status check (no nodes will be marked
down)
WARN  [GossipTasks:1] 2017-01-18 10:16:00,131  Gossiper.java:748 - Gossip
stage has 16 pending tasks; skipping status check (no nodes will be marked
down)



WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2017-01-18
06:35:44,186  AbstractCommitLogService.java:105 - Out of 33 commit log
syncs over the past 309s with average duration of 9506.03ms, 13 have
exceeded the configured commit interval by an average of 8676.77ms


*Observation:*
These 6 nodes having issues are running in Red hat version 6.8 and all
other nodes in clusters are in 6.7 version. The node in 6.7 version is not
having any gossip issue/ mutation drop/ CPU issues/ IO.
Also, the IOStats in these nodes is bad compared to other nodes. We also
see that threads are blocked at HintedHandoff.

*-bash-4.1$ sar 2 7*

10:29:51 AM CPU %user %nice   %system   %iowait%steal
%idle
10:29:53 AM all 36.83  4.86  5.51  4.65  0.00
48.15
10:29:55 AM all 41.09  4.98  6.68  2.28  0.00
44.97
10:29:57 AM all 37.53  4.71  5.52  2.04  0.00
50.21
10:29:59 AM all 35.82  4.76  4.97  2.14  0.00
52.31
10:30:01 AM all 33.90  3.53  3.82  2.62  0.00
56.13
10:30:03 AM all 31.78  2.64  3.53  4.65  0.00
57.39
10:30:05 AM all 33.27  3.90  3.95  2.76  0.00
56.12
Average:all 35.74  4.20  4.85  3.02  0.00
52.19

*Thread are also blocked HintedHandoff:*

"HintedHandoff:1" daemon prio=1 os_prio=4 tid=0x440 nid=0x44  [ JVM locked
by VM (w/poll advisory bit) acquiring VM lock
'org.apache.cassandra.db.HintedHandOffManager', polling bits: safep rstak
gpgc_clean_new ]

  java.lang.Thread.State: BLOCKED (on object monitor)

at
org.apache.cassandra.db.HintedHandOffManager.compact(HintedHandOffManager.java:269)

at
org.apache.cassandra.db.HintedHandOffManager$5.run(HintedHandOffManager.java:563)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

We are working with system team to see if there is any IO issue and is this
issue associated with any futex_wait.

Any other pointer to fix this issue will be great.

Thanks,
Sermandurai.