Re: Long unexplained time-to-safepoint (TTSP) in the JVM

Jean-Philippe BEMPEL Fri, 05 Jun 2020 01:37:13 -0700

Hi Zac,

I would suggest to use JFR to collect events about safepoints/VM
Operations. for that you can upgrade to Zulu with at least 8u212 which
includes JDK Flight Recorder (JFR).
JFR is bundle into Oracle JDK 8 but with commercial license (but maybe you
have subscribed, in this case, just enable it with -XX:+
UnlockCommercialFeatures).


You can look for details about how to start FlightRecorder here:
https://docs.oracle.com/javacomponents/jmc-5-4/jfr-runtime-guide/run.htm#JFRUH176

In JMC you can edit which events you are collecting, and refine them to
have more information about safepoints/VM Operations and also use JMC for
analyzing recordings

The advantage is you can let run it on your cluster with barely
noticeable overhead.

Regards
Jean-Philippe


On Thu, Jun 4, 2020 at 11:07 PM Zac Blanco <[email protected]> wrote:

> Hi All,
>
> I'm hoping someone might be able to point me in the right direction with
> this or offer some ideas.
> I've done a lot of research but haven't found any other threads with
> similar behavior. The closest
> one I could find was this
> <https://groups.google.com/forum/#!topic/mechanical-sympathy/9oeDihc2WiE>
> which actually never seems like it was solved.
>
> I've been working on this issue for a few weeks. We have a JVM-based
> server application which
> is experiencing long (anywhere from 10s to multiple minutes) pauses. Below
> is an example of
> the safepoint statistics output:
>
>          vmop                    [threads: total initially_running
> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
> 298.236: no vm operation                  [     385          0
>    0    ]      [     0     0 21638     1     0    ]  0
>          vmop                    [threads: total initially_running
> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
> 320.877: no vm operation                  [     336          0
>    0    ]      [     0     0  1978     1     0    ]  0
>          vmop                    [threads: total initially_running
> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
> 336.859: no vm operation                  [     309          1
>    2    ]      [     0     0  6497     1     0    ]  0
>          vmop                    [threads: total initially_running
> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
> 351.362: no vm operation                  [     294          0
>    0    ]      [     0     0 32821     0     0    ]  0
>          vmop                    [threads: total initially_running
> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
> 401.189: no vm operation                  [     296          3
>    2    ]      [     0     0  3471     0     0    ]  0
>          vmop                    [threads: total initially_running
> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
> 427.675: no vm operation                  [     303          0
>    0    ]      [     0     0 10414     1     0    ]  0
>          vmop                    [threads: total initially_running
> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
> 446.098: no vm operation                  [     346          2
>    1    ]      [     0     0 48768     1     0    ]  0
>
>
> These are all printed back-to-back when running the application. The GC
> logs don't even
> indicate that this pause was occurring.
>
> 2020-06-04T00:49:40.008+0000: 268.104: Total time for which application
> threads were stopped: 1.5127234 seconds, Stopping threads took: 1.5089316
> seconds
> 2020-06-04T00:49:42.411+0000: 270.507: Total time for which application
> threads were stopped: 0.0108752 seconds, Stopping threads took: 0.0065995
> seconds
> 2020-06-04T00:49:42.760+0000: 270.856: [GC (Allocation Failure) [
> PSYoungGen: 8389120K->1016108K(9786880K)] 8550808K->1177813K(32156672K),
> 1.8051522 secs] [Times: user=6.83 sys=4.62, real=1.80 secs]
> 2020-06-04T00:49:44.566+0000: 272.662: Total time for which application
> threads were stopped: 1.8379623 seconds, Stopping threads took: 0.0267499
> seconds
> 2020-06-04T00:49:46.114+0000: 274.210: Total time for which application
> threads were stopped: 0.5484401 seconds, Stopping threads took: 0.5438942
> seconds
> 2020-06-04T00:49:47.120+0000: 275.216: Total time for which application
> threads were stopped: 0.0055404 seconds, Stopping threads took: 0.0007684
> seconds
> 2020-06-04T00:49:50.127+0000: 278.222: Total time for which application
> threads were stopped: 0.0059341 seconds, Stopping threads took: 0.0005394
> seconds
> 2020-06-04T00:50:02.136+0000: 290.232: Total time for which application
> threads were stopped: 0.0048284 seconds, Stopping threads took: 0.0008109
> seconds
> 2020-06-04T00:50:07.140+0000: 295.235: Total time for which application
> threads were stopped: 0.0029092 seconds, Stopping threads took: 0.0004694
> seconds
> 2020-06-04T00:50:31.781+0000: 319.877: Total time for which application
> threads were stopped: 0.0035001 seconds, Stopping threads took: 0.0012628
> seconds
> 2020-06-04T00:50:34.761+0000: 322.857: Total time for which application
> threads were stopped: 0.0014956 seconds, Stopping threads took: 0.0002727
> seconds
> 2020-06-04T00:50:55.261+0000: 343.357: Total time for which application
> threads were stopped: 0.0016439 seconds, Stopping threads took: 0.0004175
> seconds
> 2020-06-04T00:51:36.088+0000: 384.184: Total time for which application
> threads were stopped: 0.0011020 seconds, Stopping threads took: 0.0001815
> seconds
> 2020-06-04T00:51:56.565+0000: 404.661: Total time for which application
> threads were stopped: 0.0012547 seconds, Stopping threads took: 0.0003102
> seconds
> 2020-06-04T00:52:29.996+0000: 438.092: Total time for which application
> threads were stopped: 0.0016347 seconds, Stopping threads took: 0.0002464
> seconds
> 2020-06-04T00:53:26.772+0000: 494.868: Total time for which application
> threads were stopped: 0.0021074 seconds, Stopping threads took: 0.0003538
> seconds
>
> During the time period ~270-494 the application pretty much stopped
> serving all IO
> requests and did not respond to any jstack operations
>
> Hardware information:
> OS: RHEL 7.8 (kernel version 3.10.0-1127)
> JVM: Oracle JDK 8u161
> Machine specs: 72 cores, 754GiB RAM.
>
> about 650 GiB of the memory is allocated to a ramdisk, the JVM application
> (as you
> will see below) has a max heap of 32GiB, and direct memory of 24GiB. This
> is the
> only application running on this server aside from a metrics collection
> daemon.
>
> The JVM has the following flags specified:
>
> -Xmx32g
> -Xms32g
> -XX:MaxDirectMemorySize=24g
> -XX:+PrintGC
> -XX:+PrintGCDateStamps
> -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps
> -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintSafepointStatistics
> -XX:PrintSafepointStatisticsCount=1
> -XX:+SafepointTimeout
> -XX:SafepointTimeoutDelay=10000
> -XX:-UseBiasedLocking
> -Xloggc:<path>/logs/gc_worker.log
> -Dcom.sun.management.jmxremote
> -Dcom.sun.management.jmxremote.local.only=false
> -Dcom.sun.management.jmxremote.port=7093
> -Dcom.sun.management.jmxremote.rmi.port=7093
> -Djava.net.preferIPv4Stack=true
>
>
> I already read through most of the discussion at these links regarding 
> somewhat
> similar behavior:
> - 
> https://stackoverflow.com/questions/30393470/what-causes-long-spin-and-sync-times-in-java
> - 
> https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ
> - 
> https://groups.google.com/forum/#!msg/mechanical-sympathy/htQ3Rc1JEKk/VzSV8iBvBgAJ
> - https://groups.google.com/forum/#!topic/mechanical-sympathy/9oeDihc2WiE
> - http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2015-May/002262.html
> - 
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201610.mbox/%3c1475948419.268606.749851841.16dfb...@webmail.messagingengine.com%3e
>
> The application itself does a lot of mapped file IO against spinning disks. 
> Each mmap
> is usually just a few MiB across 12 enterprise-class disk drives or a 
> ramdisk. However,
> previously we've been able to max out 10Gbit connections with the same 
> application on
> much smaller machines (8-16 cores, 64GiB RAM) without any issue. There are 144
> threads doing the mmapped IO at any given time on this hardware.
>
> One of the points that stands out the most in this situation is that the
> -XX:+SafepointTimeout doesn't actually print any information about threads 
> failing to come
> to a safepoint. From my understanding, any time I get a safepoint statistics 
> log where the
> "sync" time is larger than 10000, then I should be getting output to stdout 
> of the thread
> names which failed to come to a safepoint. In my logs, there is no such 
> output.
>
> Any ideas for debugging this is greatly appreciated. One thing I have not yet 
> tried is using perf
> and kernel uprobes to capture the stacks (as suggested here 
> <https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ>).
>  The reason I have not tried this
> is because this software actually runs on a cluster of ~90 nodes (all the 
> same hardware and
> software stack) and the occurrence of the pauses isseemingly random across 
> all of the nodes.
> Setting up the uprobes and collecting the data will be a complex task which I 
> haven't had the
> time for yet.
>
> Appreciate ideas anyone has to offer
>
> Thanks,
> Zac
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web, visit
> https://groups.google.com/d/msgid/mechanical-sympathy/05345cdc-fd0b-4ae6-853b-ad8e140005b6o%40googlegroups.com
> <https://groups.google.com/d/msgid/mechanical-sympathy/05345cdc-fd0b-4ae6-853b-ad8e140005b6o%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/CALrzOLt7K-_m%3Dy%3DXviEEphKHDm366h7NsK%3DeEiqrsLTkmG6-EA%40mail.gmail.com.

Re: Long unexplained time-to-safepoint (TTSP) in the JVM

Reply via email to