Thanks for the write-up; I'm sure the information will be useful to others 
in the future.


Mark

On Monday, 9 November 2020 at 03:56:29 UTC [email protected] wrote:

> I want to give a brief update and some closure to this thread.
>
> After a long period of investigation, we finally determined these long 
> pauses were NOT due to the JVM, but rather caused by the operating system. 
> The machines we were working with had vm.zone_reclaim_mode=1 across all 
> systems. We aren't sure where the configuration came from, but some places 
> around the internet suggested that if the kernel detects high memory access 
> penalties between NUMA domains, it may enable this setting. A kernel patch 
> from 2014 suggests that this should be disabled by default though: 
> https://lore.kernel.org/patchwork/patch/454592/ . I think our 
> application's particular use case exacerbated the problems that zone 
> reclaim brings do to the high amount of IO performed on the ramdisk and 
> churn rate due to the application's data eviction policies. This kernel 
> setting is not new, but seemed to evade us in our search across the 
> internet.
>
> Either way, after setting vm.zone_reclaim_mode=0, all of these long pauses 
> disappeared and the systems started behaving normally. An uneventful end to 
> this thread, but at least we were able to find a culprit.
>
> In case anyone is interested in some reading, I've collected a short list 
> of relevant web pages and articles that led us to determining the issue and 
> some discussion to help us learn more about this setting and the kernel's 
> behavior.
> - Kernel Documentation 
> <https://www.kernel.org/doc/Documentation/sysctl/vm.txt>
> - Relevant LinkedIn engineering blog post 
> <https://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases>
> - NUMA Overview <https://queue.acm.org/detail.cfm?id=2513149>
> - MySQL Swap Insanity blog 
> <https://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/>
> - Similar issue on another mailing list 
> <https://www.beowulf.org/pipermail/beowulf/2014-September/032829.html>
> - xCAT mailing list discussion 
> <https://sourceforge.net/p/xcat/mailman/message/32841877/>
>
> - Zac
>
> On Monday, June 22, 2020 at 1:10:12 PM UTC-7 [email protected] wrote:
>
>> Hi all,
>>
>> I appreciate the responses. We looked into switching JVMs. We will try to 
>> upgrade to the latest OpenJDK 8u252, but that won't be for a bit - and the 
>> Zing VM was ruled out for now unfortunately. If I can reproduce the issue 
>> in an isolated environment outside the cluster I may try the newer JVMs 
>> myself, but so far I haven't had luck isolating and reproducing it  
>> anywhere else.
>>
>> I brought up that the application performs IO via mmap in the original 
>> post because from most of my research, mmap can cause the JVM to pause for 
>> an extended period of time if there is a page fault. However, based on our 
>> system instrumentation, nothing seems to suggest that we're fetching or 
>> writing pages out to disk for tens of seconds to multiple minutes at a 
>> time. The most data we mmap at a time is probably about 1MB or so.
>>
>> According to our monitoring system the disk utilization is relatively 
>> low. Here's an example of the system behavior we are seeing. 
>>
>> These are some pauses we saw with the timestamps (the columns represent 
>> the same as they would from the safepoint output) in the last week.
>>
>> Op: operation timestamp spin block sync cleanup vmop page-trap-count
>> Op: no-vm-operation JVM Time: 2020-06-18 05:47:14.972000+00:00; 0ms 0ms 
>> 1058341ms 12ms 0ms 0ms
>> Op: no-vm-operation JVM Time: 2020-06-18 06:05:06.332000+00:00; 0ms 0ms 
>> 463072ms 15ms 0ms 0ms
>>
>>
>> Here's the monitoring system info for the ~1.5 hours around the time 
>> period from the events about (timestamps above are GMT, but monitoring is 
>> PST). The spikes in utilization and disk IO don't seem to have a strong 
>> correlation with the times that the JVM paused.
>> [image: adobe-newrelic-6-17-redacted.png]
>>
>>
>> If mmap is at the root of the issue I would have expected more IO 
>> utilization. From my understanding for mmap to take an extended period of 
>> time it would likely be evicting pages from the page cache, and then 
>> reading the data into memory. That would take a long period of time if 
>> either (1) the only pages that are evict-able are dirty since those need to 
>> be written to disk, (2) if a read from disk took a long amount of time, or 
>> some combination of both (1) and (2) . If the evict-able pages are not 
>> dirty, then a page cache miss should only take approximately the amount of 
>> time to read data from disk because I would expect evicting a page in 
>> memory to be a quick operation. Maybe I have some misconceptions about how 
>> the page cache functions though. Either way, based on the system resource 
>> utilization, it seems like it's not under any serious kind of stress during 
>> these time periods.
>>
>> I am skeptical that it is due to mmap because of the observed system 
>> stats, but I can't come up with another culprit. One of the pauses above is 
>> 1058s - nearly 18 minutes. It's hard for me to reason about how a single 
>> (or even a few) mmap calls could take so long.
>>
>> This 1000s+ pause is definitely the highest I've seen. I wonder if it 
>> could somehow be attributed to faulty disks which take inexplicably long 
>> amounts of time to read data? The problem with that theory though is that 
>> we see multi-minute pauses pretty consistently across most, if not all, 
>> machines in the cluster. The probability that so many disks could behave in 
>> such similar ways seems low.
>>
>> Going to continue to investigating the issue. My next steps for looking 
>> at this are going to include
>> - Grabbing kernel and userspace stacks during a pause
>> - Gathering info on IO operations and latency with tools like iosnoop , 
>> iolatency, or bitesize from https://github.com/brendangregg/perf-tools
>> - Collecting and analyzing stats on syscall latencies
>>
>> Thanks again for the suggestions,
>> Zac
>>
>> On Monday, June 8, 2020 at 10:13:21 AM UTC-7 Gil Tene wrote:
>>
>>> Here is an idea for a quick way to analyze this. Zing has a built-in 
>>> TTSP profiler (since we pretty much coined the term "TTSP"). You could 
>>> just download the (free) Zing trial version, run your application on it, 
>>> see if the TTSP behavior is still there, and potentially get some quick 
>>> clarity about what is causing it.
>>>
>>> Zing will, by default, print out "Detected TTSP" warnings to stderr when 
>>> it sees any thread experience a TTSP above a detection threshold (the 
>>> warning threshold defaults to 1 second, but can be set with 
>>> -XX:SafepointProfilerThresholdMS=<val>). You don't have to wait for a real 
>>> pause to hit you to find these. Zing runs all threads through a background 
>>> checkpoint every 5 seconds, so code paths with long TTSPs have a hard time 
>>> hiding from the detector. Each thread crosses the checkpoint individually 
>>> at its next safepoint polling opportunity, with no actual pause, and no 
>>> actual delay in execution. The only time a thread will bear a real cost and 
>>> interact with IPC is when its TTSP is so long that it crosses the 
>>> threshold, in which case signals are used to collect context and report it.
>>>
>>> For an example of the kind of output you would see when e.g. mapped file 
>>> page faults are the thing causing long TTSPs, see the below. Note the clear 
>>> finger pointing at org.apache.lucene.store.ByteBufferIndexInput.readByte() 
>>> as the culprit in this example.
>>>
>>> And explanation of timings in the example:
>>> - The detector fired after not seeing this thread check in for one 
>>> second (started at 41.768, send the signal at 42.768)
>>> - The signal sent to collect the context took an additional 748msec to 
>>> respond [likely because the read was held on a page fault].
>>> - The next safepoint opportunity, when it was eventually reached 1.748 
>>> seconds after the checkpoint attempt, was 
>>> at org.apache.lucene.store.DataInput.readVInt(), but it was actually held 
>>> from reaching that point by a mapped byte buffer read in 
>>> org.apache.lucene.store.ByteBufferIndexInput.readByte() likely hitting a 
>>> page fault that took 1.748 seconds to complete...
>>>
>>> -- Gil.
>>>
>>> Detected TTSP issue: start: 41.768 wait: 1748.384
>>> Dumping stack for thread 0x0000440019600000
>>> "elasticsearch[localhost][search][T#49]" id: 171575 prio: 5 os_prio: 0 
>>> sched: SCHED_OTHER allowed_cpus: 000000000ffffffffffffffff
>>> lock_release: 43.516 last_cpu: 9 cpu_time: 25
>>>   0 0x00007fc63245f5bc SafepointProfilerBuf::record_sync_stack(
>>> JavaThread*)
>>>   1 0x00007fc6324817e5 JavaThread::poll_at_safepoint_static(JavaThread*)
>>>   2 0x000000003000eae9 StubRoutines::safepoint_handler
>>>   3 0x00000000300dfc73 org.apache.lucene.store.DataInput.readVInt()I
>>>   4 0x0000000031fa91df org.apache.lucene.codecs.blocktree.
>>> SegmentTermsEnumFrame.loadBlock()V
>>>   5 0x0000000031f99664 org.apache.lucene.codecs.blocktree.
>>> SegmentTermsEnum.seekExact(Lorg/apache/lucene/util/BytesRef;)Z
>>>   6 0x0000000031fad584 org.apache.lucene.index.TermContext.build(Lorg/
>>> apache/lucene/index/IndexReaderContext;Lorg/apache/lucene/index/Term;)
>>> Lorg/apache/lucene/index/TermContext;
>>>   7 0x00000000319a40f6 interpreter
>>>   8 0x00000000319a3cba interpreter
>>>
>>>
>>> signal_sent: 42.768 signal_responded: 43.516 state: D wchan: 
>>> wait_on_page_bit_killable last_cpu: 9 cpu_time: 25
>>>   0 0x00000000300926f2 org.apache.lucene.store.ByteBufferIndexInput.
>>> readByte()B
>>>   1 0x00000000300dfacd org.apache.lucene.store.DataInput.readVInt()I
>>>   2 0x0000000031fa9119 org.apache.lucene.codecs.blocktree.
>>> SegmentTermsEnumFrame.loadBlock()V
>>>   3 0x0000000031f99664 org.apache.lucene.codecs.blocktree.
>>> SegmentTermsEnum.seekExact(Lorg/apache/lucene/util/BytesRef;)Z
>>>   4 0x0000000031fad584 org.apache.lucene.index.TermContext.build(Lorg/
>>> apache/lucene/index/IndexReaderContext;Lorg/apache/lucene/index/Term;)
>>> Lorg/apache/lucene/index/TermContext;
>>>   5 0x00000000319a40f6 interpreter
>>>   6 0x00000000319a3cba interpreter
>>>   7 0x00000000319a3cba interpreter
>>>   8 0x00000000319a3cba interpreter
>>>
>>>
>>> On Thursday, June 4, 2020 at 2:07:10 PM UTC-7, Zac Blanco wrote:
>>>
>>>> Hi All,
>>>>
>>>> I'm hoping someone might be able to point me in the right direction 
>>>> with this or offer some ideas.
>>>> I've done a lot of research but haven't found any other threads with 
>>>> similar behavior. The closest
>>>> one I could find was this 
>>>> <https://groups.google.com/forum/#!topic/mechanical-sympathy/9oeDihc2WiE> 
>>>> which actually never seems like it was solved.
>>>>
>>>> I've been working on this issue for a few weeks. We have a JVM-based 
>>>> server application which
>>>> is experiencing long (anywhere from 10s to multiple minutes) pauses. 
>>>> Below is an example of
>>>> the safepoint statistics output:
>>>>
>>>>          vmop                    [threads: total initially_running 
>>>> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
>>>> 298.236: no vm operation                  [     385          0         
>>>>      0    ]      [     0     0 21638     1     0    ]  0   
>>>>          vmop                    [threads: total initially_running 
>>>> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
>>>> 320.877: no vm operation                  [     336          0         
>>>>      0    ]      [     0     0  1978     1     0    ]  0   
>>>>          vmop                    [threads: total initially_running 
>>>> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
>>>> 336.859: no vm operation                  [     309          1         
>>>>      2    ]      [     0     0  6497     1     0    ]  0   
>>>>          vmop                    [threads: total initially_running 
>>>> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
>>>> 351.362: no vm operation                  [     294          0         
>>>>      0    ]      [     0     0 32821     0     0    ]  0   
>>>>          vmop                    [threads: total initially_running 
>>>> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
>>>> 401.189: no vm operation                  [     296          3         
>>>>      2    ]      [     0     0  3471     0     0    ]  0   
>>>>          vmop                    [threads: total initially_running 
>>>> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
>>>> 427.675: no vm operation                  [     303          0         
>>>>      0    ]      [     0     0 10414     1     0    ]  0   
>>>>          vmop                    [threads: total initially_running 
>>>> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
>>>> 446.098: no vm operation                  [     346          2         
>>>>      1    ]      [     0     0 48768     1     0    ]  0   
>>>>
>>>>
>>>> These are all printed back-to-back when running the application. The GC 
>>>> logs don't even
>>>> indicate that this pause was occurring.
>>>>
>>>> 2020-06-04T00:49:40.008+0000: 268.104: Total time for which 
>>>> application threads were stopped: 1.5127234 seconds, Stopping threads 
>>>> took: 1.5089316 seconds
>>>> 2020-06-04T00:49:42.411+0000: 270.507: Total time for which 
>>>> application threads were stopped: 0.0108752 seconds, Stopping threads 
>>>> took: 0.0065995 seconds
>>>> 2020-06-04T00:49:42.760+0000: 270.856: [GC (Allocation Failure) [
>>>> PSYoungGen: 8389120K->1016108K(9786880K)] 8550808K->1177813K(32156672K
>>>> ), 1.8051522 secs] [Times: user=6.83 sys=4.62, real=1.80 secs] 
>>>> 2020-06-04T00:49:44.566+0000: 272.662: Total time for which 
>>>> application threads were stopped: 1.8379623 seconds, Stopping threads 
>>>> took: 0.0267499 seconds
>>>> 2020-06-04T00:49:46.114+0000: 274.210: Total time for which 
>>>> application threads were stopped: 0.5484401 seconds, Stopping threads 
>>>> took: 0.5438942 seconds
>>>> 2020-06-04T00:49:47.120+0000: 275.216: Total time for which 
>>>> application threads were stopped: 0.0055404 seconds, Stopping threads 
>>>> took: 0.0007684 seconds
>>>> 2020-06-04T00:49:50.127+0000: 278.222: Total time for which 
>>>> application threads were stopped: 0.0059341 seconds, Stopping threads 
>>>> took: 0.0005394 seconds
>>>> 2020-06-04T00:50:02.136+0000: 290.232: Total time for which 
>>>> application threads were stopped: 0.0048284 seconds, Stopping threads 
>>>> took: 0.0008109 seconds
>>>> 2020-06-04T00:50:07.140+0000: 295.235: Total time for which 
>>>> application threads were stopped: 0.0029092 seconds, Stopping threads 
>>>> took: 0.0004694 seconds
>>>> 2020-06-04T00:50:31.781+0000: 319.877: Total time for which 
>>>> application threads were stopped: 0.0035001 seconds, Stopping threads 
>>>> took: 0.0012628 seconds
>>>> 2020-06-04T00:50:34.761+0000: 322.857: Total time for which 
>>>> application threads were stopped: 0.0014956 seconds, Stopping threads 
>>>> took: 0.0002727 seconds
>>>> 2020-06-04T00:50:55.261+0000: 343.357: Total time for which 
>>>> application threads were stopped: 0.0016439 seconds, Stopping threads 
>>>> took: 0.0004175 seconds
>>>> 2020-06-04T00:51:36.088+0000: 384.184: Total time for which 
>>>> application threads were stopped: 0.0011020 seconds, Stopping threads 
>>>> took: 0.0001815 seconds
>>>> 2020-06-04T00:51:56.565+0000: 404.661: Total time for which 
>>>> application threads were stopped: 0.0012547 seconds, Stopping threads 
>>>> took: 0.0003102 seconds
>>>> 2020-06-04T00:52:29.996+0000: 438.092: Total time for which 
>>>> application threads were stopped: 0.0016347 seconds, Stopping threads 
>>>> took: 0.0002464 seconds
>>>> 2020-06-04T00:53:26.772+0000: 494.868: Total time for which 
>>>> application threads were stopped: 0.0021074 seconds, Stopping threads 
>>>> took: 0.0003538 seconds
>>>>
>>>> During the time period ~270-494 the application pretty much stopped 
>>>> serving all IO
>>>> requests and did not respond to any jstack operations
>>>>
>>>> Hardware information:
>>>> OS: RHEL 7.8 (kernel version 3.10.0-1127)
>>>> JVM: Oracle JDK 8u161
>>>> Machine specs: 72 cores, 754GiB RAM.
>>>>
>>>> about 650 GiB of the memory is allocated to a ramdisk, the JVM 
>>>> application (as you
>>>> will see below) has a max heap of 32GiB, and direct memory of 24GiB. 
>>>> This is the
>>>> only application running on this server aside from a metrics collection 
>>>> daemon.
>>>>
>>>> The JVM has the following flags specified:
>>>>
>>>> -Xmx32g
>>>> -Xms32g
>>>> -XX:MaxDirectMemorySize=24g
>>>> -XX:+PrintGC
>>>> -XX:+PrintGCDateStamps
>>>> -XX:+PrintGCDetails
>>>> -XX:+PrintGCTimeStamps
>>>> -XX:+PrintGCApplicationStoppedTime
>>>> -XX:+PrintSafepointStatistics
>>>> -XX:PrintSafepointStatisticsCount=1
>>>> -XX:+SafepointTimeout
>>>> -XX:SafepointTimeoutDelay=10000
>>>> -XX:-UseBiasedLocking
>>>> -Xloggc:<path>/logs/gc_worker.log
>>>> -Dcom.sun.management.jmxremote
>>>> -Dcom.sun.management.jmxremote.local.only=false
>>>> -Dcom.sun.management.jmxremote.port=7093
>>>> -Dcom.sun.management.jmxremote.rmi.port=7093
>>>> -Djava.net.preferIPv4Stack=true
>>>>
>>>>
>>>> I already read through most of the discussion at these links regarding 
>>>> somewhat
>>>> similar behavior:
>>>> - 
>>>> https://stackoverflow.com/questions/30393470/what-causes-long-spin-and-sync-times-in-java
>>>> - 
>>>> https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ
>>>> - 
>>>> https://groups.google.com/forum/#!msg/mechanical-sympathy/htQ3Rc1JEKk/VzSV8iBvBgAJ
>>>> - https://groups.google.com/forum/#!topic/mechanical-sympathy/9oeDihc2WiE
>>>> - 
>>>> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2015-May/002262.html
>>>>
>>>> - 
>>>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201610.mbox/%[email protected]%3e
>>>>  
>>>> <http://mail-archives.apache.org/mod_mbox/cassandra-user/201610.mbox/%3c1475948419.268606.749851841.16dfb...@webmail.messagingengine.com%3e>
>>>>
>>>>
>>>>
>>>> The application itself does a lot of mapped file IO against spinning 
>>>> disks. Each mmap
>>>> is usually just a few MiB across 12 enterprise-class disk drives or a 
>>>> ramdisk. However,
>>>> previously we've been able to max out 10Gbit connections with the same 
>>>> application on
>>>> much smaller machines (8-16 cores, 64GiB RAM) without any issue. There are 
>>>> 144
>>>> threads doing the mmapped IO at any given time on this hardware.
>>>>
>>>> One of the points that stands out the most in this situation is that the
>>>> -XX:+SafepointTimeout doesn't actually print any information about threads 
>>>> failing to come
>>>> to a safepoint. From my understanding, any time I get a safepoint 
>>>> statistics log where the
>>>> "sync" time is larger than 10000, then I should be getting output to 
>>>> stdout of the thread
>>>> names which failed to come to a safepoint. In my logs, there is no such 
>>>> output.
>>>>
>>>> Any ideas for debugging this is greatly appreciated. One thing I have not 
>>>> yet tried is using perf
>>>> and kernel uprobes to capture the stacks (as suggested here 
>>>> <https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ>).
>>>>  The reason I have not tried this
>>>> is because this software actually runs on a cluster of ~90 nodes (all the 
>>>> same hardware and
>>>> software stack) and the occurrence of the pauses isseemingly random across 
>>>> all of the nodes.
>>>> Setting up the uprobes and collecting the data will be a complex task 
>>>> which I haven't had the
>>>> time for yet.
>>>>  
>>>> Appreciate ideas anyone has to offer
>>>>
>>>> Thanks,
>>>> Zac
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/6bc5f52b-3a27-4906-a476-9763a76bfa68n%40googlegroups.com.

Reply via email to