Hello Cassandra community,
I am new in the list, so no experience in posting questions. We have Cassandra 
deployed in our product. In one of the production nodes I see the following. 
The Cassandra nodes belonging to a specific datacenter have frequent Garbage 
collection impacting the system performance. A lot of queries to the Cassandra 
nodes timeout during the Garbage Collection. I know that this is expected. 
However, I cannot figure out why we have such frequent Garbagge collections. 
The production node is now isolated and does not serve any load and the 
frequent GCs are still visible. The other strange thing is that the cluster 
consists of 3 datacenters. The other 2 datacenters deployed do not show the 
same behavior and they serve traffic as well.

The deployment of the cassandra nodes is uniform for all datacenters. What is 
more, the OS and HW is also the same.

The logs that indicate the GC are the following (processed with awk to give 
more human readable output):

2022-06-28 08:21:29,538 481ms. Old Gen: 5.30356GB -> 417.736MB Survivor Space: 
63.6879MB -> 0
2022-06-28 09:28:04,229 627ms. Old Gen: 5.27757GB -> 548.777MB Survivor Space: 
55.0002MB -> 0
2022-06-28 10:29:55,439 419ms. Old Gen: 5.29753GB -> 363.493MB Survivor Space: 
79.1411MB -> 0
2022-06-28 11:28:35,382 530ms. Old Gen: 5.28328GB -> 473.46MB Survivor Space: 
60.6253MB -> 0
2022-06-28 12:22:57,251 658ms. Old Gen: 5.26409GB -> 570.007MB Survivor Space: 
63.3597MB -> 0
2022-06-28 13:20:46,893 491ms. Old Gen: 5.26584GB -> 414.906MB Survivor Space: 
64.5628MB -> 0
2022-06-28 14:21:17,507 511ms. Old Gen: 5.26839GB -> 449.541MB Survivor Space: 
66.2503MB -> 0
2022-06-28 15:21:54,272 562ms. Old Gen: 5.30167GB -> 493.219MB Survivor Space: 
69.8441MB -> 0
2022-06-28 16:17:44,517 632ms. Old Gen: 5.23687GB -> 569.884MB Survivor Space: 
65.2972MB -> 0
2022-06-28 17:08:39,581 593ms. Old Gen: 5.31225GB -> 503.452MB Survivor Space: 
9.15627MB -> 0
2022-06-28 18:00:14,865 474ms. Old Gen: 5.25287GB -> 405.377MB Survivor Space: 
66.719MB -> 0
2022-06-28 18:52:39,165 643ms. Old Gen: 5.28023GB -> 559.397MB Survivor Space: 
61.1409MB -> 0
2022-06-28 19:52:31,017 598ms. Old Gen: 5.30456GB -> 501.958MB Survivor Space: 
60.0472MB -> 0
2022-06-28 21:06:59,596 505ms. Old Gen: 5.29862GB -> 445.448MB Survivor Space: 
51.5159MB -> 0
2022-06-28 22:43:05,232 478ms. Old Gen: 5.29055GB -> 425.308MB Survivor Space: 
39.1877MB -> 0

One more strange observation is that the maximum HEAP as printed by nodetool 
info command  keeps changing in the problematic cassandra nodes even though the 
Cassandra process does not restart. Here is some monitoring of the output of 
nodetool info command in one of the problematic Cassandra nodes every 5 min:
Thu Jun 23 14:27:16 CEST 2022
Heap Memory (MB)       : 3581.64 / 8078.50
Thu Jun 23 14:32:18 CEST 2022
Heap Memory (MB)       : 4308.25 / 8089.50
Thu Jun 23 14:37:21 CEST 2022
Heap Memory (MB)       : 5024.90 / 8095.00
Thu Jun 23 14:42:24 CEST 2022
Heap Memory (MB)       : 6270.35 / 8085.00
Thu Jun 23 14:47:26 CEST 2022
Heap Memory (MB)       : 6843.82 / 8091.00
Thu Jun 23 14:52:29 CEST 2022
Heap Memory (MB)       : 881.51 / 8086.00
Thu Jun 23 14:57:32 CEST 2022
Heap Memory (MB)       : 1699.78 / 8093.50
Thu Jun 23 15:02:34 CEST 2022
Heap Memory (MB)       : 2868.04 / 8089.00
Thu Jun 23 15:07:37 CEST 2022
Heap Memory (MB)       : 4719.43 / 8067.00

This observation does not hold for the normally operating Cassandra nodes. 
Could this be related?

Can you, please, help me with the investigation of this issue? I feel that 
there is something special in the system hosting Cassandra in the problematic 
datacenter and it is not a matter of calibrating the Heap memory configuration.

BR
Michail Kotsiouros

Reply via email to