Thanks for the write-up; I'm sure the information will be useful to others in the future.
Mark On Monday, 9 November 2020 at 03:56:29 UTC [email protected] wrote: > I want to give a brief update and some closure to this thread. > > After a long period of investigation, we finally determined these long > pauses were NOT due to the JVM, but rather caused by the operating system. > The machines we were working with had vm.zone_reclaim_mode=1 across all > systems. We aren't sure where the configuration came from, but some places > around the internet suggested that if the kernel detects high memory access > penalties between NUMA domains, it may enable this setting. A kernel patch > from 2014 suggests that this should be disabled by default though: > https://lore.kernel.org/patchwork/patch/454592/ . I think our > application's particular use case exacerbated the problems that zone > reclaim brings do to the high amount of IO performed on the ramdisk and > churn rate due to the application's data eviction policies. This kernel > setting is not new, but seemed to evade us in our search across the > internet. > > Either way, after setting vm.zone_reclaim_mode=0, all of these long pauses > disappeared and the systems started behaving normally. An uneventful end to > this thread, but at least we were able to find a culprit. > > In case anyone is interested in some reading, I've collected a short list > of relevant web pages and articles that led us to determining the issue and > some discussion to help us learn more about this setting and the kernel's > behavior. > - Kernel Documentation > <https://www.kernel.org/doc/Documentation/sysctl/vm.txt> > - Relevant LinkedIn engineering blog post > <https://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases> > - NUMA Overview <https://queue.acm.org/detail.cfm?id=2513149> > - MySQL Swap Insanity blog > <https://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/> > - Similar issue on another mailing list > <https://www.beowulf.org/pipermail/beowulf/2014-September/032829.html> > - xCAT mailing list discussion > <https://sourceforge.net/p/xcat/mailman/message/32841877/> > > - Zac > > On Monday, June 22, 2020 at 1:10:12 PM UTC-7 [email protected] wrote: > >> Hi all, >> >> I appreciate the responses. We looked into switching JVMs. We will try to >> upgrade to the latest OpenJDK 8u252, but that won't be for a bit - and the >> Zing VM was ruled out for now unfortunately. If I can reproduce the issue >> in an isolated environment outside the cluster I may try the newer JVMs >> myself, but so far I haven't had luck isolating and reproducing it >> anywhere else. >> >> I brought up that the application performs IO via mmap in the original >> post because from most of my research, mmap can cause the JVM to pause for >> an extended period of time if there is a page fault. However, based on our >> system instrumentation, nothing seems to suggest that we're fetching or >> writing pages out to disk for tens of seconds to multiple minutes at a >> time. The most data we mmap at a time is probably about 1MB or so. >> >> According to our monitoring system the disk utilization is relatively >> low. Here's an example of the system behavior we are seeing. >> >> These are some pauses we saw with the timestamps (the columns represent >> the same as they would from the safepoint output) in the last week. >> >> Op: operation timestamp spin block sync cleanup vmop page-trap-count >> Op: no-vm-operation JVM Time: 2020-06-18 05:47:14.972000+00:00; 0ms 0ms >> 1058341ms 12ms 0ms 0ms >> Op: no-vm-operation JVM Time: 2020-06-18 06:05:06.332000+00:00; 0ms 0ms >> 463072ms 15ms 0ms 0ms >> >> >> Here's the monitoring system info for the ~1.5 hours around the time >> period from the events about (timestamps above are GMT, but monitoring is >> PST). The spikes in utilization and disk IO don't seem to have a strong >> correlation with the times that the JVM paused. >> [image: adobe-newrelic-6-17-redacted.png] >> >> >> If mmap is at the root of the issue I would have expected more IO >> utilization. From my understanding for mmap to take an extended period of >> time it would likely be evicting pages from the page cache, and then >> reading the data into memory. That would take a long period of time if >> either (1) the only pages that are evict-able are dirty since those need to >> be written to disk, (2) if a read from disk took a long amount of time, or >> some combination of both (1) and (2) . If the evict-able pages are not >> dirty, then a page cache miss should only take approximately the amount of >> time to read data from disk because I would expect evicting a page in >> memory to be a quick operation. Maybe I have some misconceptions about how >> the page cache functions though. Either way, based on the system resource >> utilization, it seems like it's not under any serious kind of stress during >> these time periods. >> >> I am skeptical that it is due to mmap because of the observed system >> stats, but I can't come up with another culprit. One of the pauses above is >> 1058s - nearly 18 minutes. It's hard for me to reason about how a single >> (or even a few) mmap calls could take so long. >> >> This 1000s+ pause is definitely the highest I've seen. I wonder if it >> could somehow be attributed to faulty disks which take inexplicably long >> amounts of time to read data? The problem with that theory though is that >> we see multi-minute pauses pretty consistently across most, if not all, >> machines in the cluster. The probability that so many disks could behave in >> such similar ways seems low. >> >> Going to continue to investigating the issue. My next steps for looking >> at this are going to include >> - Grabbing kernel and userspace stacks during a pause >> - Gathering info on IO operations and latency with tools like iosnoop , >> iolatency, or bitesize from https://github.com/brendangregg/perf-tools >> - Collecting and analyzing stats on syscall latencies >> >> Thanks again for the suggestions, >> Zac >> >> On Monday, June 8, 2020 at 10:13:21 AM UTC-7 Gil Tene wrote: >> >>> Here is an idea for a quick way to analyze this. Zing has a built-in >>> TTSP profiler (since we pretty much coined the term "TTSP"). You could >>> just download the (free) Zing trial version, run your application on it, >>> see if the TTSP behavior is still there, and potentially get some quick >>> clarity about what is causing it. >>> >>> Zing will, by default, print out "Detected TTSP" warnings to stderr when >>> it sees any thread experience a TTSP above a detection threshold (the >>> warning threshold defaults to 1 second, but can be set with >>> -XX:SafepointProfilerThresholdMS=<val>). You don't have to wait for a real >>> pause to hit you to find these. Zing runs all threads through a background >>> checkpoint every 5 seconds, so code paths with long TTSPs have a hard time >>> hiding from the detector. Each thread crosses the checkpoint individually >>> at its next safepoint polling opportunity, with no actual pause, and no >>> actual delay in execution. The only time a thread will bear a real cost and >>> interact with IPC is when its TTSP is so long that it crosses the >>> threshold, in which case signals are used to collect context and report it. >>> >>> For an example of the kind of output you would see when e.g. mapped file >>> page faults are the thing causing long TTSPs, see the below. Note the clear >>> finger pointing at org.apache.lucene.store.ByteBufferIndexInput.readByte() >>> as the culprit in this example. >>> >>> And explanation of timings in the example: >>> - The detector fired after not seeing this thread check in for one >>> second (started at 41.768, send the signal at 42.768) >>> - The signal sent to collect the context took an additional 748msec to >>> respond [likely because the read was held on a page fault]. >>> - The next safepoint opportunity, when it was eventually reached 1.748 >>> seconds after the checkpoint attempt, was >>> at org.apache.lucene.store.DataInput.readVInt(), but it was actually held >>> from reaching that point by a mapped byte buffer read in >>> org.apache.lucene.store.ByteBufferIndexInput.readByte() likely hitting a >>> page fault that took 1.748 seconds to complete... >>> >>> -- Gil. >>> >>> Detected TTSP issue: start: 41.768 wait: 1748.384 >>> Dumping stack for thread 0x0000440019600000 >>> "elasticsearch[localhost][search][T#49]" id: 171575 prio: 5 os_prio: 0 >>> sched: SCHED_OTHER allowed_cpus: 000000000ffffffffffffffff >>> lock_release: 43.516 last_cpu: 9 cpu_time: 25 >>> 0 0x00007fc63245f5bc SafepointProfilerBuf::record_sync_stack( >>> JavaThread*) >>> 1 0x00007fc6324817e5 JavaThread::poll_at_safepoint_static(JavaThread*) >>> 2 0x000000003000eae9 StubRoutines::safepoint_handler >>> 3 0x00000000300dfc73 org.apache.lucene.store.DataInput.readVInt()I >>> 4 0x0000000031fa91df org.apache.lucene.codecs.blocktree. >>> SegmentTermsEnumFrame.loadBlock()V >>> 5 0x0000000031f99664 org.apache.lucene.codecs.blocktree. >>> SegmentTermsEnum.seekExact(Lorg/apache/lucene/util/BytesRef;)Z >>> 6 0x0000000031fad584 org.apache.lucene.index.TermContext.build(Lorg/ >>> apache/lucene/index/IndexReaderContext;Lorg/apache/lucene/index/Term;) >>> Lorg/apache/lucene/index/TermContext; >>> 7 0x00000000319a40f6 interpreter >>> 8 0x00000000319a3cba interpreter >>> >>> >>> signal_sent: 42.768 signal_responded: 43.516 state: D wchan: >>> wait_on_page_bit_killable last_cpu: 9 cpu_time: 25 >>> 0 0x00000000300926f2 org.apache.lucene.store.ByteBufferIndexInput. >>> readByte()B >>> 1 0x00000000300dfacd org.apache.lucene.store.DataInput.readVInt()I >>> 2 0x0000000031fa9119 org.apache.lucene.codecs.blocktree. >>> SegmentTermsEnumFrame.loadBlock()V >>> 3 0x0000000031f99664 org.apache.lucene.codecs.blocktree. >>> SegmentTermsEnum.seekExact(Lorg/apache/lucene/util/BytesRef;)Z >>> 4 0x0000000031fad584 org.apache.lucene.index.TermContext.build(Lorg/ >>> apache/lucene/index/IndexReaderContext;Lorg/apache/lucene/index/Term;) >>> Lorg/apache/lucene/index/TermContext; >>> 5 0x00000000319a40f6 interpreter >>> 6 0x00000000319a3cba interpreter >>> 7 0x00000000319a3cba interpreter >>> 8 0x00000000319a3cba interpreter >>> >>> >>> On Thursday, June 4, 2020 at 2:07:10 PM UTC-7, Zac Blanco wrote: >>> >>>> Hi All, >>>> >>>> I'm hoping someone might be able to point me in the right direction >>>> with this or offer some ideas. >>>> I've done a lot of research but haven't found any other threads with >>>> similar behavior. The closest >>>> one I could find was this >>>> <https://groups.google.com/forum/#!topic/mechanical-sympathy/9oeDihc2WiE> >>>> which actually never seems like it was solved. >>>> >>>> I've been working on this issue for a few weeks. We have a JVM-based >>>> server application which >>>> is experiencing long (anywhere from 10s to multiple minutes) pauses. >>>> Below is an example of >>>> the safepoint statistics output: >>>> >>>> vmop [threads: total initially_running >>>> wait_to_block] [time: spin block sync cleanup vmop] page_trap_count >>>> 298.236: no vm operation [ 385 0 >>>> 0 ] [ 0 0 21638 1 0 ] 0 >>>> vmop [threads: total initially_running >>>> wait_to_block] [time: spin block sync cleanup vmop] page_trap_count >>>> 320.877: no vm operation [ 336 0 >>>> 0 ] [ 0 0 1978 1 0 ] 0 >>>> vmop [threads: total initially_running >>>> wait_to_block] [time: spin block sync cleanup vmop] page_trap_count >>>> 336.859: no vm operation [ 309 1 >>>> 2 ] [ 0 0 6497 1 0 ] 0 >>>> vmop [threads: total initially_running >>>> wait_to_block] [time: spin block sync cleanup vmop] page_trap_count >>>> 351.362: no vm operation [ 294 0 >>>> 0 ] [ 0 0 32821 0 0 ] 0 >>>> vmop [threads: total initially_running >>>> wait_to_block] [time: spin block sync cleanup vmop] page_trap_count >>>> 401.189: no vm operation [ 296 3 >>>> 2 ] [ 0 0 3471 0 0 ] 0 >>>> vmop [threads: total initially_running >>>> wait_to_block] [time: spin block sync cleanup vmop] page_trap_count >>>> 427.675: no vm operation [ 303 0 >>>> 0 ] [ 0 0 10414 1 0 ] 0 >>>> vmop [threads: total initially_running >>>> wait_to_block] [time: spin block sync cleanup vmop] page_trap_count >>>> 446.098: no vm operation [ 346 2 >>>> 1 ] [ 0 0 48768 1 0 ] 0 >>>> >>>> >>>> These are all printed back-to-back when running the application. The GC >>>> logs don't even >>>> indicate that this pause was occurring. >>>> >>>> 2020-06-04T00:49:40.008+0000: 268.104: Total time for which >>>> application threads were stopped: 1.5127234 seconds, Stopping threads >>>> took: 1.5089316 seconds >>>> 2020-06-04T00:49:42.411+0000: 270.507: Total time for which >>>> application threads were stopped: 0.0108752 seconds, Stopping threads >>>> took: 0.0065995 seconds >>>> 2020-06-04T00:49:42.760+0000: 270.856: [GC (Allocation Failure) [ >>>> PSYoungGen: 8389120K->1016108K(9786880K)] 8550808K->1177813K(32156672K >>>> ), 1.8051522 secs] [Times: user=6.83 sys=4.62, real=1.80 secs] >>>> 2020-06-04T00:49:44.566+0000: 272.662: Total time for which >>>> application threads were stopped: 1.8379623 seconds, Stopping threads >>>> took: 0.0267499 seconds >>>> 2020-06-04T00:49:46.114+0000: 274.210: Total time for which >>>> application threads were stopped: 0.5484401 seconds, Stopping threads >>>> took: 0.5438942 seconds >>>> 2020-06-04T00:49:47.120+0000: 275.216: Total time for which >>>> application threads were stopped: 0.0055404 seconds, Stopping threads >>>> took: 0.0007684 seconds >>>> 2020-06-04T00:49:50.127+0000: 278.222: Total time for which >>>> application threads were stopped: 0.0059341 seconds, Stopping threads >>>> took: 0.0005394 seconds >>>> 2020-06-04T00:50:02.136+0000: 290.232: Total time for which >>>> application threads were stopped: 0.0048284 seconds, Stopping threads >>>> took: 0.0008109 seconds >>>> 2020-06-04T00:50:07.140+0000: 295.235: Total time for which >>>> application threads were stopped: 0.0029092 seconds, Stopping threads >>>> took: 0.0004694 seconds >>>> 2020-06-04T00:50:31.781+0000: 319.877: Total time for which >>>> application threads were stopped: 0.0035001 seconds, Stopping threads >>>> took: 0.0012628 seconds >>>> 2020-06-04T00:50:34.761+0000: 322.857: Total time for which >>>> application threads were stopped: 0.0014956 seconds, Stopping threads >>>> took: 0.0002727 seconds >>>> 2020-06-04T00:50:55.261+0000: 343.357: Total time for which >>>> application threads were stopped: 0.0016439 seconds, Stopping threads >>>> took: 0.0004175 seconds >>>> 2020-06-04T00:51:36.088+0000: 384.184: Total time for which >>>> application threads were stopped: 0.0011020 seconds, Stopping threads >>>> took: 0.0001815 seconds >>>> 2020-06-04T00:51:56.565+0000: 404.661: Total time for which >>>> application threads were stopped: 0.0012547 seconds, Stopping threads >>>> took: 0.0003102 seconds >>>> 2020-06-04T00:52:29.996+0000: 438.092: Total time for which >>>> application threads were stopped: 0.0016347 seconds, Stopping threads >>>> took: 0.0002464 seconds >>>> 2020-06-04T00:53:26.772+0000: 494.868: Total time for which >>>> application threads were stopped: 0.0021074 seconds, Stopping threads >>>> took: 0.0003538 seconds >>>> >>>> During the time period ~270-494 the application pretty much stopped >>>> serving all IO >>>> requests and did not respond to any jstack operations >>>> >>>> Hardware information: >>>> OS: RHEL 7.8 (kernel version 3.10.0-1127) >>>> JVM: Oracle JDK 8u161 >>>> Machine specs: 72 cores, 754GiB RAM. >>>> >>>> about 650 GiB of the memory is allocated to a ramdisk, the JVM >>>> application (as you >>>> will see below) has a max heap of 32GiB, and direct memory of 24GiB. >>>> This is the >>>> only application running on this server aside from a metrics collection >>>> daemon. >>>> >>>> The JVM has the following flags specified: >>>> >>>> -Xmx32g >>>> -Xms32g >>>> -XX:MaxDirectMemorySize=24g >>>> -XX:+PrintGC >>>> -XX:+PrintGCDateStamps >>>> -XX:+PrintGCDetails >>>> -XX:+PrintGCTimeStamps >>>> -XX:+PrintGCApplicationStoppedTime >>>> -XX:+PrintSafepointStatistics >>>> -XX:PrintSafepointStatisticsCount=1 >>>> -XX:+SafepointTimeout >>>> -XX:SafepointTimeoutDelay=10000 >>>> -XX:-UseBiasedLocking >>>> -Xloggc:<path>/logs/gc_worker.log >>>> -Dcom.sun.management.jmxremote >>>> -Dcom.sun.management.jmxremote.local.only=false >>>> -Dcom.sun.management.jmxremote.port=7093 >>>> -Dcom.sun.management.jmxremote.rmi.port=7093 >>>> -Djava.net.preferIPv4Stack=true >>>> >>>> >>>> I already read through most of the discussion at these links regarding >>>> somewhat >>>> similar behavior: >>>> - >>>> https://stackoverflow.com/questions/30393470/what-causes-long-spin-and-sync-times-in-java >>>> - >>>> https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ >>>> - >>>> https://groups.google.com/forum/#!msg/mechanical-sympathy/htQ3Rc1JEKk/VzSV8iBvBgAJ >>>> - https://groups.google.com/forum/#!topic/mechanical-sympathy/9oeDihc2WiE >>>> - >>>> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2015-May/002262.html >>>> >>>> - >>>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201610.mbox/%[email protected]%3e >>>> >>>> <http://mail-archives.apache.org/mod_mbox/cassandra-user/201610.mbox/%3c1475948419.268606.749851841.16dfb...@webmail.messagingengine.com%3e> >>>> >>>> >>>> >>>> The application itself does a lot of mapped file IO against spinning >>>> disks. Each mmap >>>> is usually just a few MiB across 12 enterprise-class disk drives or a >>>> ramdisk. However, >>>> previously we've been able to max out 10Gbit connections with the same >>>> application on >>>> much smaller machines (8-16 cores, 64GiB RAM) without any issue. There are >>>> 144 >>>> threads doing the mmapped IO at any given time on this hardware. >>>> >>>> One of the points that stands out the most in this situation is that the >>>> -XX:+SafepointTimeout doesn't actually print any information about threads >>>> failing to come >>>> to a safepoint. From my understanding, any time I get a safepoint >>>> statistics log where the >>>> "sync" time is larger than 10000, then I should be getting output to >>>> stdout of the thread >>>> names which failed to come to a safepoint. In my logs, there is no such >>>> output. >>>> >>>> Any ideas for debugging this is greatly appreciated. One thing I have not >>>> yet tried is using perf >>>> and kernel uprobes to capture the stacks (as suggested here >>>> <https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ>). >>>> The reason I have not tried this >>>> is because this software actually runs on a cluster of ~90 nodes (all the >>>> same hardware and >>>> software stack) and the occurrence of the pauses isseemingly random across >>>> all of the nodes. >>>> Setting up the uprobes and collecting the data will be a complex task >>>> which I haven't had the >>>> time for yet. >>>> >>>> Appreciate ideas anyone has to offer >>>> >>>> Thanks, >>>> Zac >>>> >>>> -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web, visit https://groups.google.com/d/msgid/mechanical-sympathy/6bc5f52b-3a27-4906-a476-9763a76bfa68n%40googlegroups.com.
