Re: Long unexplained time-to-safepoint (TTSP) in the JVM

Igor Soarez Fri, 05 Jun 2020 08:13:32 -0700

Have you looked at page faults?
What happens when I/O to an mmap’ed file triggers a page fault? How can that 
affect time to safe point?


--
Igor

On Thu, Jun 4, 2020, at 8:47 PM, Zac Blanco wrote:
> Hi All,
> 
> I'm hoping someone might be able to point me in the right direction with this 
> or offer some ideas.
> I've done a lot of research but haven't found any other threads with similar 
> behavior. The closest
> one I could find was this 
> <https://groups.google.com/forum/#!topic/mechanical-sympathy/9oeDihc2WiE> 
> which actually never seems like it was solved.
> 
> I've been working on this issue for a few weeks. We have a JVM-based server 
> application which
> is experiencing long (anywhere from 10s to multiple minutes) pauses. Below is 
> an example of
> the safepoint statistics output:
> 
>  vmop [threads: total initially_running wait_to_block] [time: spin block sync 
> cleanup vmop] page_trap_count
> 298.236: no vm operation [ 385 0 0 ] [ 0 0 21638 1 0 ] 0 
>  vmop [threads: total initially_running wait_to_block] [time: spin block sync 
> cleanup vmop] page_trap_count
> 320.877: no vm operation [ 336 0 0 ] [ 0 0 1978 1 0 ] 0 
>  vmop [threads: total initially_running wait_to_block] [time: spin block sync 
> cleanup vmop] page_trap_count
> 336.859: no vm operation [ 309 1 2 ] [ 0 0 6497 1 0 ] 0 
>  vmop [threads: total initially_running wait_to_block] [time: spin block sync 
> cleanup vmop] page_trap_count
> 351.362: no vm operation [ 294 0 0 ] [ 0 0 32821 0 0 ] 0 
>  vmop [threads: total initially_running wait_to_block] [time: spin block sync 
> cleanup vmop] page_trap_count
> 401.189: no vm operation [ 296 3 2 ] [ 0 0 3471 0 0 ] 0 
>  vmop [threads: total initially_running wait_to_block] [time: spin block sync 
> cleanup vmop] page_trap_count
> 427.675: no vm operation [ 303 0 0 ] [ 0 0 10414 1 0 ] 0 
>  vmop [threads: total initially_running wait_to_block] [time: spin block sync 
> cleanup vmop] page_trap_count
> 446.098: no vm operation [ 346 2 1 ] [ 0 0 48768 1 0 ] 0 
> 
> These are all printed back-to-back when running the application. The GC logs 
> don't even
> indicate that this pause was occurring.
> 
> 2020-06-04T00:49:40.008+0000: 268.104: Total time for which application 
> threads were stopped: 1.5127234 seconds, Stopping threads took: 1.5089316 
> seconds
> 2020-06-04T00:49:42.411+0000: 270.507: Total time for which application 
> threads were stopped: 0.0108752 seconds, Stopping threads took: 0.0065995 
> seconds
> 2020-06-04T00:49:42.760+0000: 270.856: [GC (Allocation Failure) [PSYoungGen: 
> 8389120K->1016108K(9786880K)] 8550808K->1177813K(32156672K), 1.8051522 secs] 
> [Times: user=6.83 sys=4.62, real=1.80 secs] 
> 2020-06-04T00:49:44.566+0000: 272.662: Total time for which application 
> threads were stopped: 1.8379623 seconds, Stopping threads took: 0.0267499 
> seconds
> 2020-06-04T00:49:46.114+0000: 274.210: Total time for which application 
> threads were stopped: 0.5484401 seconds, Stopping threads took: 0.5438942 
> seconds
> 2020-06-04T00:49:47.120+0000: 275.216: Total time for which application 
> threads were stopped: 0.0055404 seconds, Stopping threads took: 0.0007684 
> seconds
> 2020-06-04T00:49:50.127+0000: 278.222: Total time for which application 
> threads were stopped: 0.0059341 seconds, Stopping threads took: 0.0005394 
> seconds
> 2020-06-04T00:50:02.136+0000: 290.232: Total time for which application 
> threads were stopped: 0.0048284 seconds, Stopping threads took: 0.0008109 
> seconds
> 2020-06-04T00:50:07.140+0000: 295.235: Total time for which application 
> threads were stopped: 0.0029092 seconds, Stopping threads took: 0.0004694 
> seconds
> 2020-06-04T00:50:31.781+0000: 319.877: Total time for which application 
> threads were stopped: 0.0035001 seconds, Stopping threads took: 0.0012628 
> seconds
> 2020-06-04T00:50:34.761+0000: 322.857: Total time for which application 
> threads were stopped: 0.0014956 seconds, Stopping threads took: 0.0002727 
> seconds
> 2020-06-04T00:50:55.261+0000: 343.357: Total time for which application 
> threads were stopped: 0.0016439 seconds, Stopping threads took: 0.0004175 
> seconds
> 2020-06-04T00:51:36.088+0000: 384.184: Total time for which application 
> threads were stopped: 0.0011020 seconds, Stopping threads took: 0.0001815 
> seconds
> 2020-06-04T00:51:56.565+0000: 404.661: Total time for which application 
> threads were stopped: 0.0012547 seconds, Stopping threads took: 0.0003102 
> seconds
> 2020-06-04T00:52:29.996+0000: 438.092: Total time for which application 
> threads were stopped: 0.0016347 seconds, Stopping threads took: 0.0002464 
> seconds
> 2020-06-04T00:53:26.772+0000: 494.868: Total time for which application 
> threads were stopped: 0.0021074 seconds, Stopping threads took: 0.0003538 
> seconds
> During the time period ~270-494 the application pretty much stopped serving 
> all IO
> requests and did not respond to any jstack operations
> 
> Hardware information:
> OS: RHEL 7.8 (kernel version 3.10.0-1127)
> JVM: Oracle JDK 8u161
> Machine specs: 72 cores, 754GiB RAM.
> 
> about 650 GiB of the memory is allocated to a ramdisk, the JVM application 
> (as you
> will see below) has a max heap of 32GiB, and direct memory of 24GiB. This is 
> the
> only application running on this server aside from a metrics collection 
> daemon.
> 
> The JVM has the following flags specified:
> 
> -Xmx32g
> -Xms32g
> -XX:MaxDirectMemorySize=24g
> -XX:+PrintGC
> -XX:+PrintGCDateStamps
> -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps
> -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintSafepointStatistics
> -XX:PrintSafepointStatisticsCount=1
> -XX:+SafepointTimeout
> -XX:SafepointTimeoutDelay=10000
> -XX:-UseBiasedLocking
> -Xloggc:<path>/logs/gc_worker.log
> -Dcom.sun.management.jmxremote
> -Dcom.sun.management.jmxremote.local.only=false
> -Dcom.sun.management.jmxremote.port=7093
> -Dcom.sun.management.jmxremote.rmi.port=7093
> -Djava.net.preferIPv4Stack=true
> 
> 
> I already read through most of the discussion at these links regarding 
> somewhat
> similar behavior:
> - 
> https://stackoverflow.com/questions/30393470/what-causes-long-spin-and-sync-times-in-java
> - 
> https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ
> - 
> https://groups.google.com/forum/#!msg/mechanical-sympathy/htQ3Rc1JEKk/VzSV8iBvBgAJ
> - https://groups.google.com/forum/#!topic/mechanical-sympathy/9oeDihc2WiE
> - http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2015-May/002262.html
> - 
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201610.mbox/%3c1475948419.268606.749851841.16dfb...@webmail.messagingengine.com%3e
> 
> The application itself does a lot of mapped file IO against spinning disks. 
> Each mmap
> is usually just a few MiB across 12 enterprise-class disk drives or a 
> ramdisk. However,
> previously we've been able to max out 10Gbit connections with the same 
> application on
> much smaller machines (8-16 cores, 64GiB RAM) without any issue. There are 144
> threads doing the mmapped IO at any given time on this hardware.
> One of the points that stands out the most in this situation is that the
> -XX:+SafepointTimeout doesn't actually print any information about threads 
> failing to come
> to a safepoint. From my understanding, any time I get a safepoint statistics 
> log where the
> "sync" time is larger than 10000, then I should be getting output to stdout 
> of the thread
> names which failed to come to a safepoint. In my logs, there is no such 
> output.
> 
> Any ideas for debugging this is greatly appreciated. One thing I have not yet 
> tried is using perf
> and kernel uprobes to capture the stacks (as suggested here 
> <https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ>).
>  The reason I have not tried this
> is because this software actually runs on a cluster of ~90 nodes (all the 
> same hardware and
> software stack) and the occurrence of the pauses isseemingly random across 
> all of the nodes.
> Setting up the uprobes and collecting the data will be a complex task which I 
> haven't had the
> time for yet.
>  
> Appreciate ideas anyone has to offer
> 
> Thanks,
> Zac
> 

> --
>  You received this message because you are subscribed to the Google Groups 
> "mechanical-sympathy" group.
>  To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
>  To view this discussion on the web, visit 
> https://groups.google.com/d/msgid/mechanical-sympathy/05345cdc-fd0b-4ae6-853b-ad8e140005b6o%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/mechanical-sympathy/05345cdc-fd0b-4ae6-853b-ad8e140005b6o%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/a9520ec9-b6e0-4321-bcc0-0d5b41df4b83%40www.fastmail.com.

Re: Long unexplained time-to-safepoint (TTSP) in the JVM

Reply via email to