RE: Strange Garbage Collections results in service impact

2022-07-06 Thread Michail Kotsiouros via user
Hello Cassandra community,
I have restarted one of the Cassandra nodes that the issue was observed and the 
GC has stopped being invoked so frequently and causing the service impact. The 
rest of the Cassandra nodes in the same datacenter that were not restarted 
still have this issue.
What is more, the max Heap memory reported by nodetool info has been stabilized 
in the restarted Cassandra node. Now it reports the same value as the Cassandra 
nodes in the rest operating datacenters. Again, the Cassandra nodes in the 
problematic datacenter that have not been restarted still report different max 
Heap memory in the nodetool info printout as monitored every 5 minutes for an 
hour.
>From this observation, I get that restarting of the Cassandra nodes is the way 
>forward to resolve the frequent GC issue. What is more, the changing max Heap 
>memory reported by nodetool info from the Cassandra nodes affected seems to be 
>related with the root cause of the problem.
Does this ring a bell? Have anyone observed such behavior before? What could 
have caused it? Could it be something in the JVM?

BR
MK
From: Michail Kotsiouros via user 
Sent: Friday, July 1, 2022 11:01
To: user@cassandra.apache.org
Subject: Strange Garbage Collections results in service impact

Hello Cassandra community,
I am new in the list, so no experience in posting questions. We have Cassandra 
deployed in our product. In one of the production nodes I see the following. 
The Cassandra nodes belonging to a specific datacenter have frequent Garbage 
collection impacting the system performance. A lot of queries to the Cassandra 
nodes timeout during the Garbage Collection. I know that this is expected. 
However, I cannot figure out why we have such frequent Garbagge collections. 
The production node is now isolated and does not serve any load and the 
frequent GCs are still visible. The other strange thing is that the cluster 
consists of 3 datacenters. The other 2 datacenters deployed do not show the 
same behavior and they serve traffic as well.

The deployment of the cassandra nodes is uniform for all datacenters. What is 
more, the OS and HW is also the same.

The logs that indicate the GC are the following (processed with awk to give 
more human readable output):

2022-06-28 08:21:29,538 481ms. Old Gen: 5.30356GB -> 417.736MB Survivor Space: 
63.6879MB -> 0
2022-06-28 09:28:04,229 627ms. Old Gen: 5.27757GB -> 548.777MB Survivor Space: 
55.0002MB -> 0
2022-06-28 10:29:55,439 419ms. Old Gen: 5.29753GB -> 363.493MB Survivor Space: 
79.1411MB -> 0
2022-06-28 11:28:35,382 530ms. Old Gen: 5.28328GB -> 473.46MB Survivor Space: 
60.6253MB -> 0
2022-06-28 12:22:57,251 658ms. Old Gen: 5.26409GB -> 570.007MB Survivor Space: 
63.3597MB -> 0
2022-06-28 13:20:46,893 491ms. Old Gen: 5.26584GB -> 414.906MB Survivor Space: 
64.5628MB -> 0
2022-06-28 14:21:17,507 511ms. Old Gen: 5.26839GB -> 449.541MB Survivor Space: 
66.2503MB -> 0
2022-06-28 15:21:54,272 562ms. Old Gen: 5.30167GB -> 493.219MB Survivor Space: 
69.8441MB -> 0
2022-06-28 16:17:44,517 632ms. Old Gen: 5.23687GB -> 569.884MB Survivor Space: 
65.2972MB -> 0
2022-06-28 17:08:39,581 593ms. Old Gen: 5.31225GB -> 503.452MB Survivor Space: 
9.15627MB -> 0
2022-06-28 18:00:14,865 474ms. Old Gen: 5.25287GB -> 405.377MB Survivor Space: 
66.719MB -> 0
2022-06-28 18:52:39,165 643ms. Old Gen: 5.28023GB -> 559.397MB Survivor Space: 
61.1409MB -> 0
2022-06-28 19:52:31,017 598ms. Old Gen: 5.30456GB -> 501.958MB Survivor Space: 
60.0472MB -> 0
2022-06-28 21:06:59,596 505ms. Old Gen: 5.29862GB -> 445.448MB Survivor Space: 
51.5159MB -> 0
2022-06-28 22:43:05,232 478ms. Old Gen: 5.29055GB -> 425.308MB Survivor Space: 
39.1877MB -> 0

One more strange observation is that the maximum HEAP as printed by nodetool 
info command  keeps changing in the problematic cassandra nodes even though the 
Cassandra process does not restart. Here is some monitoring of the output of 
nodetool info command in one of the problematic Cassandra nodes every 5 min:
Thu Jun 23 14:27:16 CEST 2022
Heap Memory (MB)   : 3581.64 / 8078.50
Thu Jun 23 14:32:18 CEST 2022
Heap Memory (MB)   : 4308.25 / 8089.50
Thu Jun 23 14:37:21 CEST 2022
Heap Memory (MB)   : 5024.90 / 8095.00
Thu Jun 23 14:42:24 CEST 2022
Heap Memory (MB)   : 6270.35 / 8085.00
Thu Jun 23 14:47:26 CEST 2022
Heap Memory (MB)   : 6843.82 / 8091.00
Thu Jun 23 14:52:29 CEST 2022
Heap Memory (MB)   : 881.51 / 8086.00
Thu Jun 23 14:57:32 CEST 2022
Heap Memory (MB)   : 1699.78 / 8093.50
Thu Jun 23 15:02:34 CEST 2022
Heap Memory (MB)   : 2868.04 / 8089.00
Thu Jun 23 15:07:37 CEST 2022
Heap Memory (MB)   : 4719.43 / 8067.00

This observation does not hold for the normally operating Cassandra nodes. 
Could this be related?

Can you, please, help me with the investigation of this issue? I feel that 
there is something special in the system hosting Cassan

Strange Garbage Collections results in service impact

2022-07-01 Thread Michail Kotsiouros via user
Hello Cassandra community,
I am new in the list, so no experience in posting questions. We have Cassandra 
deployed in our product. In one of the production nodes I see the following. 
The Cassandra nodes belonging to a specific datacenter have frequent Garbage 
collection impacting the system performance. A lot of queries to the Cassandra 
nodes timeout during the Garbage Collection. I know that this is expected. 
However, I cannot figure out why we have such frequent Garbagge collections. 
The production node is now isolated and does not serve any load and the 
frequent GCs are still visible. The other strange thing is that the cluster 
consists of 3 datacenters. The other 2 datacenters deployed do not show the 
same behavior and they serve traffic as well.

The deployment of the cassandra nodes is uniform for all datacenters. What is 
more, the OS and HW is also the same.

The logs that indicate the GC are the following (processed with awk to give 
more human readable output):

2022-06-28 08:21:29,538 481ms. Old Gen: 5.30356GB -> 417.736MB Survivor Space: 
63.6879MB -> 0
2022-06-28 09:28:04,229 627ms. Old Gen: 5.27757GB -> 548.777MB Survivor Space: 
55.0002MB -> 0
2022-06-28 10:29:55,439 419ms. Old Gen: 5.29753GB -> 363.493MB Survivor Space: 
79.1411MB -> 0
2022-06-28 11:28:35,382 530ms. Old Gen: 5.28328GB -> 473.46MB Survivor Space: 
60.6253MB -> 0
2022-06-28 12:22:57,251 658ms. Old Gen: 5.26409GB -> 570.007MB Survivor Space: 
63.3597MB -> 0
2022-06-28 13:20:46,893 491ms. Old Gen: 5.26584GB -> 414.906MB Survivor Space: 
64.5628MB -> 0
2022-06-28 14:21:17,507 511ms. Old Gen: 5.26839GB -> 449.541MB Survivor Space: 
66.2503MB -> 0
2022-06-28 15:21:54,272 562ms. Old Gen: 5.30167GB -> 493.219MB Survivor Space: 
69.8441MB -> 0
2022-06-28 16:17:44,517 632ms. Old Gen: 5.23687GB -> 569.884MB Survivor Space: 
65.2972MB -> 0
2022-06-28 17:08:39,581 593ms. Old Gen: 5.31225GB -> 503.452MB Survivor Space: 
9.15627MB -> 0
2022-06-28 18:00:14,865 474ms. Old Gen: 5.25287GB -> 405.377MB Survivor Space: 
66.719MB -> 0
2022-06-28 18:52:39,165 643ms. Old Gen: 5.28023GB -> 559.397MB Survivor Space: 
61.1409MB -> 0
2022-06-28 19:52:31,017 598ms. Old Gen: 5.30456GB -> 501.958MB Survivor Space: 
60.0472MB -> 0
2022-06-28 21:06:59,596 505ms. Old Gen: 5.29862GB -> 445.448MB Survivor Space: 
51.5159MB -> 0
2022-06-28 22:43:05,232 478ms. Old Gen: 5.29055GB -> 425.308MB Survivor Space: 
39.1877MB -> 0

One more strange observation is that the maximum HEAP as printed by nodetool 
info command  keeps changing in the problematic cassandra nodes even though the 
Cassandra process does not restart. Here is some monitoring of the output of 
nodetool info command in one of the problematic Cassandra nodes every 5 min:
Thu Jun 23 14:27:16 CEST 2022
Heap Memory (MB)   : 3581.64 / 8078.50
Thu Jun 23 14:32:18 CEST 2022
Heap Memory (MB)   : 4308.25 / 8089.50
Thu Jun 23 14:37:21 CEST 2022
Heap Memory (MB)   : 5024.90 / 8095.00
Thu Jun 23 14:42:24 CEST 2022
Heap Memory (MB)   : 6270.35 / 8085.00
Thu Jun 23 14:47:26 CEST 2022
Heap Memory (MB)   : 6843.82 / 8091.00
Thu Jun 23 14:52:29 CEST 2022
Heap Memory (MB)   : 881.51 / 8086.00
Thu Jun 23 14:57:32 CEST 2022
Heap Memory (MB)   : 1699.78 / 8093.50
Thu Jun 23 15:02:34 CEST 2022
Heap Memory (MB)   : 2868.04 / 8089.00
Thu Jun 23 15:07:37 CEST 2022
Heap Memory (MB)   : 4719.43 / 8067.00

This observation does not hold for the normally operating Cassandra nodes. 
Could this be related?

Can you, please, help me with the investigation of this issue? I feel that 
there is something special in the system hosting Cassandra in the problematic 
datacenter and it is not a matter of calibrating the Heap memory configuration.

BR
Michail Kotsiouros