[jira] [Comment Edited] (CASSANDRA-20980) GCInspector should use different thresholds on GC events for warning messages

Dmitry Konstantinov (Jira) Sat, 15 Nov 2025 09:32:16 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-20980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18038577#comment-18038577
 ]


Dmitry Konstantinov edited comment on CASSANDRA-20980 at 11/15/25 5:30 PM:
---------------------------------------------------------------------------

I've done few e2e tests for the changes by running Cassandra node built from 
[https://github.com/apache/cassandra/pull/4454]

There are few non-committed debug logging changes to simplify analysis: logging 
of listen MBeans + logging of all GC events):

[^debug_logging_gc_inspector.patch]

The behaviour looks as expected for me.
Used JDK: 
{code:java}
openjdk version "17.0.15" 2025-04-15
OpenJDK Runtime Environment Temurin-17.0.15+6 (build 17.0.15+6) 
{code}
I've run a few load tests to cause GC.
Default GC inspector limits were used:
gc_log_threshold = 200 ms
gc_warn_threshold = 1s
gc_concurrent_phase_log_threshold = 1s
gc_concurrent_phase_warn_threshold = 2s
h3. G1

[^system_g1.log] [^gc_g1.log] 
MBeans
{code:java}
java.lang:name=G1 Young Generation,type=GarbageCollector
java.lang:name=G1 Old Generation,type=GarbageCollector
{code}
Pauses detection still works:
{code:java}
INFO  [Notification Thread] 2025-11-15T16:19:13,080 GCInspector.java:312 - G1 
Young Generation GC in 233ms.  G1 Eden Space: 6476005376 -> 0; G1 Old Gen: 
22573666304 -> 23790014464; G1 Survivor Space: 1258291200 -> 1073741824; 
{code}
Note: we do not have an official support of JDK 21 as of now, I so have not 
tested G1 concurrent mode, A new JMX bean for it has been added in JDK20 
([https://bugs.openjdk.org/browse/JDK-8297754]).
h3. Shenandoah

[^system_shenandoah.log] [^gc_shenandoah.log] 
MBeans
{code:java}
java.lang:name=Shenandoah Pauses,type=GarbageCollector
java.lang:name=Shenandoah Cycles,type=GarbageCollector
{code}
We log Shenandoah Cycles concurrent GCs with time > 1 sec as INFO:
{code:java}
INFO  [Notification Thread] 2025-11-15T16:31:21,372 GCInspector.java:300 - 
Shenandoah Cycles GC in 1762ms.  CodeHeap 'non-nmethods': 2047744 -> 3201536; 
CodeHeap 'non-profiled nmethods': 18649088 -> 18653952; CodeHeap 'profiled 
nmethods': 19999488 -> 20063744; Metaspace: 63209472 -> 63218224; Shenandoah: 
17291619784 -> 6818925104
{code}
We log Shenandoah Cycles concurrent GCs with time > 2 sec as WARN:
{code:java}
WARN  [Notification Thread] 2025-11-15T16:30:53,812 GCInspector.java:298 - 
Shenandoah Cycles GC in 3813ms.  CodeHeap 'non-nmethods': 3778432 -> 2047744; 
CodeHeap 'non-profiled nmethods': 18428800 -> 18444032; CodeHeap 'profiled 
nmethods': 19460992 -> 19521024; Metaspace: 63072296 -> 63089704; Shenandoah: 
22980826480 -> 22186858112
{code}
We do not log Shenandoah Cycles concurrent GCs with time < 1 sec and > 300 ms 
(this is the main goal of this improvement):
{code:java}
// it is present only in debug output but not reported by GCInspector real 
logging
INFO  [Notification Thread] 2025-11-15T16:31:50,315 GCInspector.java:265 - 
debug, gcId: 51, gcName: Shenandoah Cycles, gcCause: Concurrent GC, gcAction: 
end of GC cycle, event duration: 349, calculated duration: 348
{code}
h3. ZGC

[^system_zgc.log] [^gc_zgc.log] 
MBeans
{code:java}
java.lang:name=ZGC Cycles,type=GarbageCollector
java.lang:name=ZGC Pauses,type=GarbageCollector
{code}
We log ZGC Cycles concurrent GCs with time > 1 sec as INFO:
{code:java}
INFO  [Notification Thread] 2025-11-15T16:38:53,877 GCInspector.java:300 - ZGC 
Cycles GC in 1694ms.  CodeHeap 'non-nmethods': 3414912 -> 1684224; CodeHeap 
'non-profiled nmethods': 13763712 -> 14028160; CodeHeap 'profiled nmethods': 
19596416 -> 18350848; Compressed Class Space: 7549152 -> 7551488; Metaspace: 
62210520 -> 62218096; ZHeap: 6757023744 -> 9124708352
{code}
We log ZGC Cycles concurrent GCs with time > 2 sec as WARN:
{code:java}
WARN  [Notification Thread] 2025-11-15T16:38:56,746 GCInspector.java:298 - ZGC 
Cycles GC in 2660ms.  CodeHeap 'non-nmethods': 3414912 -> 1684224; CodeHeap 
'non-profiled nmethods': 14354048 -> 14567424; CodeHeap 'profiled nmethods': 
18724352 -> 18601856; Compressed Class Space: 7556032 -> 7556544; Metaspace: 
62223720 -> 62262456; ZHeap: 10284433408 -> 17983078400
{code}
We do not log ZGC Cycles concurrent GCs with time < 1 sec and > 300 ms (this is 
the main goal of this improvement):
{code:java}
// it is present only in debug output but not reported by GCInspector real 
logging
INFO  [Notification Thread] 2025-11-15T16:42:49,461 GCInspector.java:265 - 
debug, gcId: 50, gcName: ZGC Cycles, gcCause: Allocation Rate, gcAction: end of 
GC cycle, event duration: 379, calculated duration: 380

{code}


was (Author: dnk):
I've done few e2e tests for the changes (by running Cassandra node built from 
[https://github.com/apache/cassandra/pull/4454] with few debug logging changes 
to simplify analysis: logging of listen MBeans + logging of all GC events).

The behaviour looks as expected for me.
Used JDK: 
{code:java}
openjdk version "17.0.15" 2025-04-15
OpenJDK Runtime Environment Temurin-17.0.15+6 (build 17.0.15+6) 
{code}
I've run a few load tests to cause GC.
Default GC inspector limits were used:
gc_log_threshold = 200 ms
gc_warn_threshold = 1s
gc_concurrent_phase_log_threshold = 1s
gc_concurrent_phase_warn_threshold = 2s
h3. G1

[^system_g1.log] [^gc_g1.log] 
MBeans
{code:java}
java.lang:name=G1 Young Generation,type=GarbageCollector
java.lang:name=G1 Old Generation,type=GarbageCollector
{code}
Pauses detection still works:
{code:java}
INFO  [Notification Thread] 2025-11-15T16:19:13,080 GCInspector.java:312 - G1 
Young Generation GC in 233ms.  G1 Eden Space: 6476005376 -> 0; G1 Old Gen: 
22573666304 -> 23790014464; G1 Survivor Space: 1258291200 -> 1073741824; 
{code}
Note: we do not have an official support of JDK 21 as of now, I so have not 
tested G1 concurrent mode, A new JMX bean for it has been added in JDK20 
([https://bugs.openjdk.org/browse/JDK-8297754]).
h3. Shenandoah

[^system_shenandoah.log] [^gc_shenandoah.log] 
MBeans
{code:java}
java.lang:name=Shenandoah Pauses,type=GarbageCollector
java.lang:name=Shenandoah Cycles,type=GarbageCollector
{code}
We log Shenandoah Cycles concurrent GCs with time > 1 sec as INFO:
{code:java}
INFO  [Notification Thread] 2025-11-15T16:31:21,372 GCInspector.java:300 - 
Shenandoah Cycles GC in 1762ms.  CodeHeap 'non-nmethods': 2047744 -> 3201536; 
CodeHeap 'non-profiled nmethods': 18649088 -> 18653952; CodeHeap 'profiled 
nmethods': 19999488 -> 20063744; Metaspace: 63209472 -> 63218224; Shenandoah: 
17291619784 -> 6818925104
{code}
We log Shenandoah Cycles concurrent GCs with time > 2 sec as WARN:
{code:java}
WARN  [Notification Thread] 2025-11-15T16:30:53,812 GCInspector.java:298 - 
Shenandoah Cycles GC in 3813ms.  CodeHeap 'non-nmethods': 3778432 -> 2047744; 
CodeHeap 'non-profiled nmethods': 18428800 -> 18444032; CodeHeap 'profiled 
nmethods': 19460992 -> 19521024; Metaspace: 63072296 -> 63089704; Shenandoah: 
22980826480 -> 22186858112
{code}
We do not log Shenandoah Cycles concurrent GCs with time < 1 sec and > 300 ms 
(this is the main goal of this improvement):
{code:java}
// it is present only in debug output but not reported by GCInspector real 
logging
INFO  [Notification Thread] 2025-11-15T16:31:50,315 GCInspector.java:265 - 
debug, gcId: 51, gcName: Shenandoah Cycles, gcCause: Concurrent GC, gcAction: 
end of GC cycle, event duration: 349, calculated duration: 348
{code}
h3. ZGC

[^system_zgc.log] [^gc_zgc.log] 
MBeans
{code:java}
java.lang:name=ZGC Cycles,type=GarbageCollector
java.lang:name=ZGC Pauses,type=GarbageCollector
{code}
We log ZGC Cycles concurrent GCs with time > 1 sec as INFO:
{code:java}
INFO  [Notification Thread] 2025-11-15T16:38:53,877 GCInspector.java:300 - ZGC 
Cycles GC in 1694ms.  CodeHeap 'non-nmethods': 3414912 -> 1684224; CodeHeap 
'non-profiled nmethods': 13763712 -> 14028160; CodeHeap 'profiled nmethods': 
19596416 -> 18350848; Compressed Class Space: 7549152 -> 7551488; Metaspace: 
62210520 -> 62218096; ZHeap: 6757023744 -> 9124708352
{code}
We log ZGC Cycles concurrent GCs with time > 2 sec as WARN:
{code:java}
WARN  [Notification Thread] 2025-11-15T16:38:56,746 GCInspector.java:298 - ZGC 
Cycles GC in 2660ms.  CodeHeap 'non-nmethods': 3414912 -> 1684224; CodeHeap 
'non-profiled nmethods': 14354048 -> 14567424; CodeHeap 'profiled nmethods': 
18724352 -> 18601856; Compressed Class Space: 7556032 -> 7556544; Metaspace: 
62223720 -> 62262456; ZHeap: 10284433408 -> 17983078400
{code}
We do not log ZGC Cycles concurrent GCs with time < 1 sec and > 300 ms (this is 
the main goal of this improvement):
{code:java}
// it is present only in debug output but not reported by GCInspector real 
logging
INFO  [Notification Thread] 2025-11-15T16:42:49,461 GCInspector.java:265 - 
debug, gcId: 50, gcName: ZGC Cycles, gcCause: Allocation Rate, gcAction: end of 
GC cycle, event duration: 379, calculated duration: 380

{code}

> GCInspector should use different thresholds on GC events for warning messages
> -----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20980
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20980
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Observability/Logging
>            Reporter: Yuqi Yan
>            Assignee: Yuqi Yan
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: debug_logging_gc_inspector.patch, gc_g1.log, 
> gc_shenandoah.log, gc_zgc.log, system_g1.log, system_shenandoah.log, 
> system_zgc.log
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Running Cassandra 4.1 and recently testing with ShenandoahGC.
> Noticed that the log message is flooded with the warning message coming from 
> GCInspector.
> Taking a closer look, that seems that we're warning on every GC event that 
> takes longer than 1 second (default). And from document that the intention 
> was to monitor GC pause time.
> But for ShenandoahGC and ZGC, there are concurrent phases (no STW) and phases 
> that actually STW. 
> For those concurrent phases, it's expected to take longer time (if we're on 
> the non-generational version), and I feel that these phases:
> 1. require STW
> 2. doesn't require STW
> should use different thresholds in warning message.
> At least to me, phases require STW actually pause the thread, and should use 
> lower threshold compared to those concurrent phases.
> Will share a patch soon



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-20980) GCInspector should use different thresholds on GC events for warning messages

Reply via email to