Hi All, I'm hoping someone might be able to point me in the right direction with this or offer some ideas. I've done a lot of research but haven't found any other threads with similar behavior. The closest one I could find was this <https://groups.google.com/forum/#!topic/mechanical-sympathy/9oeDihc2WiE> which actually never seems like it was solved.
I've been working on this issue for a few weeks. We have a JVM-based server application which is experiencing long (anywhere from 10s to multiple minutes) pauses. Below is an example of the safepoint statistics output: vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count 298.236: no vm operation [ 385 0 0 ] [ 0 0 21638 1 0 ] 0 vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count 320.877: no vm operation [ 336 0 0 ] [ 0 0 1978 1 0 ] 0 vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count 336.859: no vm operation [ 309 1 2 ] [ 0 0 6497 1 0 ] 0 vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count 351.362: no vm operation [ 294 0 0 ] [ 0 0 32821 0 0 ] 0 vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count 401.189: no vm operation [ 296 3 2 ] [ 0 0 3471 0 0 ] 0 vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count 427.675: no vm operation [ 303 0 0 ] [ 0 0 10414 1 0 ] 0 vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count 446.098: no vm operation [ 346 2 1 ] [ 0 0 48768 1 0 ] 0 These are all printed back-to-back when running the application. The GC logs don't even indicate that this pause was occurring. 2020-06-04T00:49:40.008+0000: 268.104: Total time for which application threads were stopped: 1.5127234 seconds, Stopping threads took: 1.5089316 seconds 2020-06-04T00:49:42.411+0000: 270.507: Total time for which application threads were stopped: 0.0108752 seconds, Stopping threads took: 0.0065995 seconds 2020-06-04T00:49:42.760+0000: 270.856: [GC (Allocation Failure) [PSYoungGen: 8389120K->1016108K(9786880K)] 8550808K->1177813K(32156672K), 1.8051522 secs] [Times: user=6.83 sys=4.62, real=1.80 secs] 2020-06-04T00:49:44.566+0000: 272.662: Total time for which application threads were stopped: 1.8379623 seconds, Stopping threads took: 0.0267499 seconds 2020-06-04T00:49:46.114+0000: 274.210: Total time for which application threads were stopped: 0.5484401 seconds, Stopping threads took: 0.5438942 seconds 2020-06-04T00:49:47.120+0000: 275.216: Total time for which application threads were stopped: 0.0055404 seconds, Stopping threads took: 0.0007684 seconds 2020-06-04T00:49:50.127+0000: 278.222: Total time for which application threads were stopped: 0.0059341 seconds, Stopping threads took: 0.0005394 seconds 2020-06-04T00:50:02.136+0000: 290.232: Total time for which application threads were stopped: 0.0048284 seconds, Stopping threads took: 0.0008109 seconds 2020-06-04T00:50:07.140+0000: 295.235: Total time for which application threads were stopped: 0.0029092 seconds, Stopping threads took: 0.0004694 seconds 2020-06-04T00:50:31.781+0000: 319.877: Total time for which application threads were stopped: 0.0035001 seconds, Stopping threads took: 0.0012628 seconds 2020-06-04T00:50:34.761+0000: 322.857: Total time for which application threads were stopped: 0.0014956 seconds, Stopping threads took: 0.0002727 seconds 2020-06-04T00:50:55.261+0000: 343.357: Total time for which application threads were stopped: 0.0016439 seconds, Stopping threads took: 0.0004175 seconds 2020-06-04T00:51:36.088+0000: 384.184: Total time for which application threads were stopped: 0.0011020 seconds, Stopping threads took: 0.0001815 seconds 2020-06-04T00:51:56.565+0000: 404.661: Total time for which application threads were stopped: 0.0012547 seconds, Stopping threads took: 0.0003102 seconds 2020-06-04T00:52:29.996+0000: 438.092: Total time for which application threads were stopped: 0.0016347 seconds, Stopping threads took: 0.0002464 seconds 2020-06-04T00:53:26.772+0000: 494.868: Total time for which application threads were stopped: 0.0021074 seconds, Stopping threads took: 0.0003538 seconds During the time period ~270-494 the application pretty much stopped serving all IO requests and did not respond to any jstack operations Hardware information: OS: RHEL 7.8 (kernel version 3.10.0-1127) JVM: Oracle JDK 8u161 Machine specs: 72 cores, 754GiB RAM. about 650 GiB of the memory is allocated to a ramdisk, the JVM application (as you will see below) has a max heap of 32GiB, and direct memory of 24GiB. This is the only application running on this server aside from a metrics collection daemon. The JVM has the following flags specified: -Xmx32g -Xms32g -XX:MaxDirectMemorySize=24g -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=10000 -XX:-UseBiasedLocking -Xloggc:<path>/logs/gc_worker.log -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.port=7093 -Dcom.sun.management.jmxremote.rmi.port=7093 -Djava.net.preferIPv4Stack=true I already read through most of the discussion at these links regarding somewhat similar behavior: - https://stackoverflow.com/questions/30393470/what-causes-long-spin-and-sync-times-in-java - https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ - https://groups.google.com/forum/#!msg/mechanical-sympathy/htQ3Rc1JEKk/VzSV8iBvBgAJ - https://groups.google.com/forum/#!topic/mechanical-sympathy/9oeDihc2WiE - http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2015-May/002262.html - http://mail-archives.apache.org/mod_mbox/cassandra-user/201610.mbox/%3c1475948419.268606.749851841.16dfb...@webmail.messagingengine.com%3e The application itself does a lot of mapped file IO against spinning disks. Each mmap is usually just a few MiB across 12 enterprise-class disk drives or a ramdisk. However, previously we've been able to max out 10Gbit connections with the same application on much smaller machines (8-16 cores, 64GiB RAM) without any issue. There are 144 threads doing the mmapped IO at any given time on this hardware. One of the points that stands out the most in this situation is that the -XX:+SafepointTimeout doesn't actually print any information about threads failing to come to a safepoint. From my understanding, any time I get a safepoint statistics log where the "sync" time is larger than 10000, then I should be getting output to stdout of the thread names which failed to come to a safepoint. In my logs, there is no such output. Any ideas for debugging this is greatly appreciated. One thing I have not yet tried is using perf and kernel uprobes to capture the stacks (as suggested here <https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ>). The reason I have not tried this is because this software actually runs on a cluster of ~90 nodes (all the same hardware and software stack) and the occurrence of the pauses isseemingly random across all of the nodes. Setting up the uprobes and collecting the data will be a complex task which I haven't had the time for yet. Appreciate ideas anyone has to offer Thanks, Zac -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscr...@googlegroups.com. To view this discussion on the web, visit https://groups.google.com/d/msgid/mechanical-sympathy/05345cdc-fd0b-4ae6-853b-ad8e140005b6o%40googlegroups.com.