Long unexplained time-to-safepoint (TTSP) in the JVM

Zac Blanco Thu, 04 Jun 2020 14:07:54 -0700

Hi All,

I'm hoping someone might be able to point me in the right direction with 
this or offer some ideas.
I've done a lot of research but haven't found any other threads with 
similar behavior. The closest
one I could find was this 
<https://groups.google.com/forum/#!topic/mechanical-sympathy/9oeDihc2WiE> 
which actually never seems like it was solved.


I've been working on this issue for a few weeks. We have a JVM-based server 
application which
is experiencing long (anywhere from 10s to multiple minutes) pauses. Below 
is an example of
the safepoint statistics output:

         vmop                    [threads: total initially_running 
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
298.236: no vm operation                  [     385          0              
0    ]      [     0     0 21638     1     0    ]  0   
         vmop                    [threads: total initially_running 
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
320.877: no vm operation                  [     336          0              
0    ]      [     0     0  1978     1     0    ]  0   
         vmop                    [threads: total initially_running 
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
336.859: no vm operation                  [     309          1              
2    ]      [     0     0  6497     1     0    ]  0   
         vmop                    [threads: total initially_running 
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
351.362: no vm operation                  [     294          0              
0    ]      [     0     0 32821     0     0    ]  0   
         vmop                    [threads: total initially_running 
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
401.189: no vm operation                  [     296          3              
2    ]      [     0     0  3471     0     0    ]  0   
         vmop                    [threads: total initially_running 
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
427.675: no vm operation                  [     303          0              
0    ]      [     0     0 10414     1     0    ]  0   
         vmop                    [threads: total initially_running 
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
446.098: no vm operation                  [     346          2              
1    ]      [     0     0 48768     1     0    ]  0   


These are all printed back-to-back when running the application. The GC 
logs don't even
indicate that this pause was occurring.

2020-06-04T00:49:40.008+0000: 268.104: Total time for which application 
threads were stopped: 1.5127234 seconds, Stopping threads took: 1.5089316 
seconds
2020-06-04T00:49:42.411+0000: 270.507: Total time for which application 
threads were stopped: 0.0108752 seconds, Stopping threads took: 0.0065995 
seconds
2020-06-04T00:49:42.760+0000: 270.856: [GC (Allocation Failure) [PSYoungGen: 
8389120K->1016108K(9786880K)] 8550808K->1177813K(32156672K), 1.8051522 secs] 
[Times: user=6.83 sys=4.62, real=1.80 secs] 
2020-06-04T00:49:44.566+0000: 272.662: Total time for which application 
threads were stopped: 1.8379623 seconds, Stopping threads took: 0.0267499 
seconds
2020-06-04T00:49:46.114+0000: 274.210: Total time for which application 
threads were stopped: 0.5484401 seconds, Stopping threads took: 0.5438942 
seconds
2020-06-04T00:49:47.120+0000: 275.216: Total time for which application 
threads were stopped: 0.0055404 seconds, Stopping threads took: 0.0007684 
seconds
2020-06-04T00:49:50.127+0000: 278.222: Total time for which application 
threads were stopped: 0.0059341 seconds, Stopping threads took: 0.0005394 
seconds
2020-06-04T00:50:02.136+0000: 290.232: Total time for which application 
threads were stopped: 0.0048284 seconds, Stopping threads took: 0.0008109 
seconds
2020-06-04T00:50:07.140+0000: 295.235: Total time for which application 
threads were stopped: 0.0029092 seconds, Stopping threads took: 0.0004694 
seconds
2020-06-04T00:50:31.781+0000: 319.877: Total time for which application 
threads were stopped: 0.0035001 seconds, Stopping threads took: 0.0012628 
seconds
2020-06-04T00:50:34.761+0000: 322.857: Total time for which application 
threads were stopped: 0.0014956 seconds, Stopping threads took: 0.0002727 
seconds
2020-06-04T00:50:55.261+0000: 343.357: Total time for which application 
threads were stopped: 0.0016439 seconds, Stopping threads took: 0.0004175 
seconds
2020-06-04T00:51:36.088+0000: 384.184: Total time for which application 
threads were stopped: 0.0011020 seconds, Stopping threads took: 0.0001815 
seconds
2020-06-04T00:51:56.565+0000: 404.661: Total time for which application 
threads were stopped: 0.0012547 seconds, Stopping threads took: 0.0003102 
seconds
2020-06-04T00:52:29.996+0000: 438.092: Total time for which application 
threads were stopped: 0.0016347 seconds, Stopping threads took: 0.0002464 
seconds
2020-06-04T00:53:26.772+0000: 494.868: Total time for which application 
threads were stopped: 0.0021074 seconds, Stopping threads took: 0.0003538 
seconds

During the time period ~270-494 the application pretty much stopped serving 
all IO
requests and did not respond to any jstack operations

Hardware information:
OS: RHEL 7.8 (kernel version 3.10.0-1127)
JVM: Oracle JDK 8u161
Machine specs: 72 cores, 754GiB RAM.

about 650 GiB of the memory is allocated to a ramdisk, the JVM application 
(as you
will see below) has a max heap of 32GiB, and direct memory of 24GiB. This 
is the
only application running on this server aside from a metrics collection 
daemon.

The JVM has the following flags specified:

-Xmx32g
-Xms32g
-XX:MaxDirectMemorySize=24g
-XX:+PrintGC
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1
-XX:+SafepointTimeout
-XX:SafepointTimeoutDelay=10000
-XX:-UseBiasedLocking
-Xloggc:<path>/logs/gc_worker.log
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=7093
-Dcom.sun.management.jmxremote.rmi.port=7093
-Djava.net.preferIPv4Stack=true


I already read through most of the discussion at these links regarding somewhat
similar behavior:
- 
https://stackoverflow.com/questions/30393470/what-causes-long-spin-and-sync-times-in-java
- 
https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ
- 
https://groups.google.com/forum/#!msg/mechanical-sympathy/htQ3Rc1JEKk/VzSV8iBvBgAJ
- https://groups.google.com/forum/#!topic/mechanical-sympathy/9oeDihc2WiE
- http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2015-May/002262.html
- 
http://mail-archives.apache.org/mod_mbox/cassandra-user/201610.mbox/%3c1475948419.268606.749851841.16dfb...@webmail.messagingengine.com%3e

The application itself does a lot of mapped file IO against spinning disks. 
Each mmap
is usually just a few MiB across 12 enterprise-class disk drives or a ramdisk. 
However,
previously we've been able to max out 10Gbit connections with the same 
application on
much smaller machines (8-16 cores, 64GiB RAM) without any issue. There are 144
threads doing the mmapped IO at any given time on this hardware.

One of the points that stands out the most in this situation is that the
-XX:+SafepointTimeout doesn't actually print any information about threads 
failing to come
to a safepoint. From my understanding, any time I get a safepoint statistics 
log where the
"sync" time is larger than 10000, then I should be getting output to stdout of 
the thread
names which failed to come to a safepoint. In my logs, there is no such output.

Any ideas for debugging this is greatly appreciated. One thing I have not yet 
tried is using perf
and kernel uprobes to capture the stacks (as suggested here 
<https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ>).
 The reason I have not tried this
is because this software actually runs on a cluster of ~90 nodes (all the same 
hardware and
software stack) and the occurrence of the pauses isseemingly random across all 
of the nodes.
Setting up the uprobes and collecting the data will be a complex task which I 
haven't had the
time for yet.
 
Appreciate ideas anyone has to offer

Thanks,
Zac

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/05345cdc-fd0b-4ae6-853b-ad8e140005b6o%40googlegroups.com.

Long unexplained time-to-safepoint (TTSP) in the JVM

Reply via email to