Have you looked at page faults? What happens when I/O to an mmap’ed file triggers a page fault? How can that affect time to safe point?
-- Igor On Thu, Jun 4, 2020, at 8:47 PM, Zac Blanco wrote: > Hi All, > > I'm hoping someone might be able to point me in the right direction with this > or offer some ideas. > I've done a lot of research but haven't found any other threads with similar > behavior. The closest > one I could find was this > <https://groups.google.com/forum/#!topic/mechanical-sympathy/9oeDihc2WiE> > which actually never seems like it was solved. > > I've been working on this issue for a few weeks. We have a JVM-based server > application which > is experiencing long (anywhere from 10s to multiple minutes) pauses. Below is > an example of > the safepoint statistics output: > > vmop [threads: total initially_running wait_to_block] [time: spin block sync > cleanup vmop] page_trap_count > 298.236: no vm operation [ 385 0 0 ] [ 0 0 21638 1 0 ] 0 > vmop [threads: total initially_running wait_to_block] [time: spin block sync > cleanup vmop] page_trap_count > 320.877: no vm operation [ 336 0 0 ] [ 0 0 1978 1 0 ] 0 > vmop [threads: total initially_running wait_to_block] [time: spin block sync > cleanup vmop] page_trap_count > 336.859: no vm operation [ 309 1 2 ] [ 0 0 6497 1 0 ] 0 > vmop [threads: total initially_running wait_to_block] [time: spin block sync > cleanup vmop] page_trap_count > 351.362: no vm operation [ 294 0 0 ] [ 0 0 32821 0 0 ] 0 > vmop [threads: total initially_running wait_to_block] [time: spin block sync > cleanup vmop] page_trap_count > 401.189: no vm operation [ 296 3 2 ] [ 0 0 3471 0 0 ] 0 > vmop [threads: total initially_running wait_to_block] [time: spin block sync > cleanup vmop] page_trap_count > 427.675: no vm operation [ 303 0 0 ] [ 0 0 10414 1 0 ] 0 > vmop [threads: total initially_running wait_to_block] [time: spin block sync > cleanup vmop] page_trap_count > 446.098: no vm operation [ 346 2 1 ] [ 0 0 48768 1 0 ] 0 > > These are all printed back-to-back when running the application. The GC logs > don't even > indicate that this pause was occurring. > > 2020-06-04T00:49:40.008+0000: 268.104: Total time for which application > threads were stopped: 1.5127234 seconds, Stopping threads took: 1.5089316 > seconds > 2020-06-04T00:49:42.411+0000: 270.507: Total time for which application > threads were stopped: 0.0108752 seconds, Stopping threads took: 0.0065995 > seconds > 2020-06-04T00:49:42.760+0000: 270.856: [GC (Allocation Failure) [PSYoungGen: > 8389120K->1016108K(9786880K)] 8550808K->1177813K(32156672K), 1.8051522 secs] > [Times: user=6.83 sys=4.62, real=1.80 secs] > 2020-06-04T00:49:44.566+0000: 272.662: Total time for which application > threads were stopped: 1.8379623 seconds, Stopping threads took: 0.0267499 > seconds > 2020-06-04T00:49:46.114+0000: 274.210: Total time for which application > threads were stopped: 0.5484401 seconds, Stopping threads took: 0.5438942 > seconds > 2020-06-04T00:49:47.120+0000: 275.216: Total time for which application > threads were stopped: 0.0055404 seconds, Stopping threads took: 0.0007684 > seconds > 2020-06-04T00:49:50.127+0000: 278.222: Total time for which application > threads were stopped: 0.0059341 seconds, Stopping threads took: 0.0005394 > seconds > 2020-06-04T00:50:02.136+0000: 290.232: Total time for which application > threads were stopped: 0.0048284 seconds, Stopping threads took: 0.0008109 > seconds > 2020-06-04T00:50:07.140+0000: 295.235: Total time for which application > threads were stopped: 0.0029092 seconds, Stopping threads took: 0.0004694 > seconds > 2020-06-04T00:50:31.781+0000: 319.877: Total time for which application > threads were stopped: 0.0035001 seconds, Stopping threads took: 0.0012628 > seconds > 2020-06-04T00:50:34.761+0000: 322.857: Total time for which application > threads were stopped: 0.0014956 seconds, Stopping threads took: 0.0002727 > seconds > 2020-06-04T00:50:55.261+0000: 343.357: Total time for which application > threads were stopped: 0.0016439 seconds, Stopping threads took: 0.0004175 > seconds > 2020-06-04T00:51:36.088+0000: 384.184: Total time for which application > threads were stopped: 0.0011020 seconds, Stopping threads took: 0.0001815 > seconds > 2020-06-04T00:51:56.565+0000: 404.661: Total time for which application > threads were stopped: 0.0012547 seconds, Stopping threads took: 0.0003102 > seconds > 2020-06-04T00:52:29.996+0000: 438.092: Total time for which application > threads were stopped: 0.0016347 seconds, Stopping threads took: 0.0002464 > seconds > 2020-06-04T00:53:26.772+0000: 494.868: Total time for which application > threads were stopped: 0.0021074 seconds, Stopping threads took: 0.0003538 > seconds > During the time period ~270-494 the application pretty much stopped serving > all IO > requests and did not respond to any jstack operations > > Hardware information: > OS: RHEL 7.8 (kernel version 3.10.0-1127) > JVM: Oracle JDK 8u161 > Machine specs: 72 cores, 754GiB RAM. > > about 650 GiB of the memory is allocated to a ramdisk, the JVM application > (as you > will see below) has a max heap of 32GiB, and direct memory of 24GiB. This is > the > only application running on this server aside from a metrics collection > daemon. > > The JVM has the following flags specified: > > -Xmx32g > -Xms32g > -XX:MaxDirectMemorySize=24g > -XX:+PrintGC > -XX:+PrintGCDateStamps > -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps > -XX:+PrintGCApplicationStoppedTime > -XX:+PrintSafepointStatistics > -XX:PrintSafepointStatisticsCount=1 > -XX:+SafepointTimeout > -XX:SafepointTimeoutDelay=10000 > -XX:-UseBiasedLocking > -Xloggc:<path>/logs/gc_worker.log > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.local.only=false > -Dcom.sun.management.jmxremote.port=7093 > -Dcom.sun.management.jmxremote.rmi.port=7093 > -Djava.net.preferIPv4Stack=true > > > I already read through most of the discussion at these links regarding > somewhat > similar behavior: > - > https://stackoverflow.com/questions/30393470/what-causes-long-spin-and-sync-times-in-java > - > https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ > - > https://groups.google.com/forum/#!msg/mechanical-sympathy/htQ3Rc1JEKk/VzSV8iBvBgAJ > - https://groups.google.com/forum/#!topic/mechanical-sympathy/9oeDihc2WiE > - http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2015-May/002262.html > - > http://mail-archives.apache.org/mod_mbox/cassandra-user/201610.mbox/%3c1475948419.268606.749851841.16dfb...@webmail.messagingengine.com%3e > > The application itself does a lot of mapped file IO against spinning disks. > Each mmap > is usually just a few MiB across 12 enterprise-class disk drives or a > ramdisk. However, > previously we've been able to max out 10Gbit connections with the same > application on > much smaller machines (8-16 cores, 64GiB RAM) without any issue. There are 144 > threads doing the mmapped IO at any given time on this hardware. > One of the points that stands out the most in this situation is that the > -XX:+SafepointTimeout doesn't actually print any information about threads > failing to come > to a safepoint. From my understanding, any time I get a safepoint statistics > log where the > "sync" time is larger than 10000, then I should be getting output to stdout > of the thread > names which failed to come to a safepoint. In my logs, there is no such > output. > > Any ideas for debugging this is greatly appreciated. One thing I have not yet > tried is using perf > and kernel uprobes to capture the stacks (as suggested here > <https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ>). > The reason I have not tried this > is because this software actually runs on a cluster of ~90 nodes (all the > same hardware and > software stack) and the occurrence of the pauses isseemingly random across > all of the nodes. > Setting up the uprobes and collecting the data will be a complex task which I > haven't had the > time for yet. > > Appreciate ideas anyone has to offer > > Thanks, > Zac > > -- > You received this message because you are subscribed to the Google Groups > "mechanical-sympathy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web, visit > https://groups.google.com/d/msgid/mechanical-sympathy/05345cdc-fd0b-4ae6-853b-ad8e140005b6o%40googlegroups.com > > <https://groups.google.com/d/msgid/mechanical-sympathy/05345cdc-fd0b-4ae6-853b-ad8e140005b6o%40googlegroups.com?utm_medium=email&utm_source=footer>. -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web, visit https://groups.google.com/d/msgid/mechanical-sympathy/a9520ec9-b6e0-4321-bcc0-0d5b41df4b83%40www.fastmail.com.
