On Tue, Dec 5, 2017 at 11:22 AM, Mark Price <[email protected]> wrote:

> Hi Kirk,
> thanks for the response.
>
> I've run the same workload with -XX:+KeepSafepointsInCountedLoops, and
> the problem persists, so I'm fairly certain it's not that (assuming the JVM
> is respecting the flags...). The application is GC-free, so the safepoints
> are only going to be coming from (non-exhaustive): guaranteed safepoints,
> JIT operation?, lock biasing, others?
>
> The environment is indeed toxic: multiple JVMs mmap files on the order of
> 1/4 system memory, performing either read or write operations and the
> page-cache flusher threads are struggling to cope (not surprisingly). I
> don't really want to go down the route of using madvise/mlock.
>
> Happily, this is a test environment, so I can be brave with profiling -
> what sort of instrumentation do you refer to - kernel jprobes or JVM
> hacking?
>

Hi Mark, Kirk,

This seems to be a common question which could do with a generic solution.
This can be done with kernel uprobes tracing without changing the jvm.
There will be less guesswork left with a kernel profiler that can report
both the kernel and user stacks.
To find out:
1) what provoked the safepoint
   VMThread execute : _ZN8VMThread7executeEP12VM_Operation
  We could trace the scheduler wake event at this point but scheduler
events can be very high frequency, we avoid if possible.
2) what is delaying the safepoint sync
  SafpointSynchronize block : _ZN20SafepointSynchronize5blockEP10JavaThread


The proof of concept OpenJdk Safepoint profiler example uses linux perf.
Async profiler may be able to do a better job at reporting the interpreted
frames.


libjvm=
perf probe -d probe_libjvm:* && for p in
_ZN8VMThread7executeEP12VM_Operation
_ZN20SafepointSynchronize16do_cleanup_tasksEv
_ZN20SafepointSynchronize5blockEP10JavaThread
_ZN20SafepointSynchronize5beginEv _ZN2os28make_polling_page_unreadableEv
_ZN20SafepointSynchronize3endEv _ZN2os26make_polling_page_readableEv ;do
perf probe -x $libjvm $p;done

java_cmd=
perf record -e cpu-clock -e probe_libjvm:* -g -- taskset -c 1,2 $java_cmd
-cp . -XX:+PrintGCApplicationStoppedTime -XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1 -XX:+PreserveFramePointer
BogusPauseReportingDemo -t 100
BogusPauseReportingDemo started, will run for 100 msec
         vmop                    [threads: total initially_running
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
12.614: ParallelGCFailedAllocation       [      10          0
2    ]      [     0   217   218     0    34    ]  0
Total time for which application threads were stopped: 0.2530938 seconds,
Stopping threads took: 0.2180568 seconds
BogusPauseReportingDemo done...
         vmop                    [threads: total initially_running
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
13.178: no vm operation                  [       6          0
0    ]      [     0     0     0     0     0    ]  0

Polling page always armed
ParallelGCFailedAllocation         2
    0 VM operations coalesced during safepoint
Maximum sync time    241 ms
Maximum vm operation time (except for Exit VM operation)     34 ms
[ perf record: Woken up 46 times to write data ]
[ perf record: Captured and wrote 2.574 MB perf.data (14328 samples) ]




perf script|grep probe_libjvm

java 17997 [001] 2446326.288185:
probe_libjvm:_ZN8VMThread7executeEP12VM_Operation: (7f78a2367ee0)
java 17859 [001] 2446326.288251:
 probe_libjvm:_ZN20SafepointSynchronize5beginEv: (7f78a21d2280)
java 17859 [001] 2446326.288261:
 probe_libjvm:_ZN2os23serialize_thread_statesEv: (7f78a21272a0)
java 17859 [001] 2446326.288352:
probe_libjvm:_ZN2os28make_polling_page_unreadableEv: (7f78a2131260)
java 17997 [001] 2446326.288403:
probe_libjvm:_ZN20SafepointSynchronize5blockEP10JavaThread: (7f78a21d3a70)
java 17856 [002] 2446326.366545:
probe_libjvm:_ZN20SafepointSynchronize5blockEP10JavaThread: (7f78a21d3a70)
java 17998 [002] 2446326.506283:
probe_libjvm:_ZN20SafepointSynchronize5blockEP10JavaThread: (7f78a21d3a70)
java 17859 [002] 2446326.506313:
probe_libjvm:_ZN20SafepointSynchronize16do_cleanup_tasksEv: (7f78a21d20d0)
java 17859 [002] 2446326.541251:
 probe_libjvm:_ZN20SafepointSynchronize3endEv: (7f78a21d3670)
java 17859 [002] 2446326.541320:
probe_libjvm:_ZN2os26make_polling_page_readableEv: (7f78a21312c0)
...


java 17997 [001] 2446326.288185:
probe_libjvm:_ZN8VMThread7executeEP12VM_Operation: (7f78a2367ee0)
                  b2fee0 VMThread::execute
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  ae96f3 TypeArrayKlass::allocate_common
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  681577 InterpreterRuntime::newarray
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
            7f788d0244c7 Interpreter (/tmp/perf-17855.map)
            7f788d0004e7 call_stub (/tmp/perf-17855.map)
                  6941e3 JavaCalls::call_helper
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  691747 JavaCalls::call_virtual
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  691d27 JavaCalls::call_virtual
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  6d5864 thread_entry
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  acdfbb JavaThread::thread_main_inner
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  ace49f JavaThread::run
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  8f2f62 java_start
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                    773a start_thread (/usr/lib64/libpthread-2.24.so)


java 17997 [001] 2446326.288403:
probe_libjvm:_ZN20SafepointSynchronize5blockEP10JavaThread: (7f78a21d3a70)
                  99ba70 SafepointSynchronize::block
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  b30354 VMThread::execute
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  91a3e8 ParallelScavengeHeap::mem_allocate
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  ae96f3 TypeArrayKlass::allocate_common
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  681577 InterpreterRuntime::newarray
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
            7f788d0244c7 Interpreter (/tmp/perf-17855.map)
            7f788d0004e7 call_stub (/tmp/perf-17855.map)
                  6941e3 JavaCalls::call_helper
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  691747 JavaCalls::call_virtual
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  691d27 JavaCalls::call_virtual
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  6d5864 thread_entry
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  acdfbb JavaThread::thread_main_inner
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  ace49f JavaThread::run
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  8f2f62 java_start
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                    773a start_thread (/usr/lib64/libpthread-2.24.so)


java 17856 [002] 2446326.366545:
probe_libjvm:_ZN20SafepointSynchronize5blockEP10JavaThread: (7f78a21d3a70)
                  99ba70 SafepointSynchronize::block
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  6db9a0 JVM_Sleep
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
            7f788d0174d4 Interpreter (/tmp/perf-17855.map)
            7f788d007ffd Interpreter (/tmp/perf-17855.map)
            7f788d0004e7 call_stub (/tmp/perf-17855.map)
                  6941e3 JavaCalls::call_helper
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  6b6e43 jni_invoke_static
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  6b9566 jni_CallStaticVoidMethod
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                    3371 JavaMain
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/jli/libjli.so)
                    773a start_thread (/usr/lib64/libpthread-2.24.so)


java 17998 [002] 2446326.506283:
probe_libjvm:_ZN20SafepointSynchronize5blockEP10JavaThread: (7f78a21d3a70)
                  99ba70 SafepointSynchronize::block
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
            7f788d108c2a Ljava/lang/System;::arraycopy (/tmp/perf-17855.map)
            7f788d007ffd Interpreter (/tmp/perf-17855.map)
            7f788d0004e7 call_stub (/tmp/perf-17855.map)
                  6941e3 JavaCalls::call_helper
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  691747 JavaCalls::call_virtual
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  691d27 JavaCalls::call_virtual
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  6d5864 thread_entry
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  acdfbb JavaThread::thread_main_inner
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  ace49f JavaThread::run
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                  8f2f62 java_start
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
                    773a start_thread (/usr/lib64/libpthread-2.24.so)

Thanks
Alex


>
>
> Cheers,
>
> Mark
>
> On Tue, Dec 5, 2017 at 11:11 AM Kirk Pepperdine <[email protected]> wrote:
>
>> Hi,
>>
>>
>> On Dec 5, 2017, at 11:53 AM, Gil Tene <[email protected]> wrote:
>>
>> Page faults in mapped file i/o and counted loops are certainly two common
>> causes of long TTSP. But there are many other paths that *could* cause it
>> as well in HotSpot. Without catching it and looking at the stack trace,
>> it's hard to know which ones to blame. Once you knock out one cause, you'll
>> see if there is another.
>>
>> In the specific stack trace you showed [assuming that trace was taken
>> during a long TTSP], mapped file i/o is the most likely culprit. Your trace
>> seems to be around making the page write-able for the first time and
>> updating the file time (which takes a lock), but even without needing the
>> lock, the fault itself could end up waiting for the i/o to complete (read
>> page from disk), and that (when Murh=phy pays you a visit) can end up
>> waiting behind 100s other i/o operations (e.g. when your i/o happens at the
>> same time the kernel decided to flush some dirty pages in the cache),
>> leading to TTSPs in the 100s of msec.
>>
>> As I'm sure you already know, one simple way to get around mapped file
>> related TTSP is to not used mapped files. Explicit random i/o calls are
>> always done while at a safepoint, so they can't cause high TTSPs.
>>
>>
>> I guess another way to avoid the long TTSP is to not safe point.. ;-)
>>
>> Most of the long TTSP’s IME are either due to outside scheduling
>> interference or loops that are compiled to the point where they turn into
>> counted loops. Finding this using traditional tooling is impossible because
>> you need the safe point which has most likely been compiled out of the code
>> to get a sample to see why you’re waiting for a safe point. Using an async
>> profiler (like honest) leaves you wanting for some correlation between
>> calls for safe point and the samples you’re looking at. So, I don’t really
>> have a good answer as to how to find these things. What I have done in the
>> past is used the GC log to get an idea of the toxicity level of the
>> environment you’re running in and then use that information to decide if I
>> need to look for internal or external causes. If the causes are internal
>> than it’s most likely a counted loop. Using JITWatch in this cause can you
>> get a candidate list of counted loops to consider. After that I’ve binary
>> searched through that list. If the environment is toxic.. then you need to
>> clean up the environment. If you are brave, you add some instrumentation to
>> the JVM to sample for the last thread to reach the safe point and use that
>> in your prod environment.
>>
>> Kind regards,
>> Kirk
>>
>>
>> On Tuesday, December 5, 2017 at 10:30:57 AM UTC+1, Mark Price wrote:
>>>
>>> Hi Aleksey,
>>> thanks for the response. The I/O is definitely one problem, but I was
>>> trying to figure out whether it was contributing to the long TTSP times, or
>>> whether I might have some code that was misbehaving (e.g. NonCountedLoops).
>>>
>>> Your response aligns with my guesswork, so hopefully I just have the one
>>> problem to solve ;)
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Mark
>>>
>>> On Tuesday, 5 December 2017 09:24:33 UTC, Aleksey Shipilev wrote:
>>>>
>>>> On 12/05/2017 09:26 AM, Mark Price wrote:
>>>> > I'm investigating some long time-to-safepoint pauses in
>>>> oracle/openjdk. The application in question
>>>> > is also suffering from some fairly nasty I/O problems where
>>>> latency-sensitive threads are being
>>>> > descheduled in uninterruptible sleep state due to needing a
>>>> file-system lock.
>>>> >
>>>> > My question: can the JVM detect that a thread is in
>>>> signal/interrupt-handler code and thus treat it
>>>> > as though it is at a safepoint (as I believe happens when a thread is
>>>> in native code via a JNI call)?
>>>> >
>>>> > For instance, given the stack trace below, will the JVM need to wait
>>>> for the thread to be scheduled
>>>> > back on to CPU in order to come to a safepoint, or will it be treated
>>>> as "in-native"?
>>>> >
>>>> >         7fff81714cd9 __schedule ([kernel.kallsyms])
>>>> >         7fff817151e5 schedule ([kernel.kallsyms])
>>>> >         7fff81717a4b rwsem_down_write_failed ([kernel.kallsyms])
>>>> >         7fff813556e7 call_rwsem_down_write_failed ([kernel.kallsyms])
>>>> >         7fff817172ad down_write ([kernel.kallsyms])
>>>> >         7fffa0403dcf xfs_ilock ([kernel.kallsyms])
>>>> >         7fffa04018fe xfs_vn_update_time ([kernel.kallsyms])
>>>> >         7fff8122cc5d file_update_time ([kernel.kallsyms])
>>>> >         7fffa03f7183 xfs_filemap_page_mkwrite ([kernel.kallsyms])
>>>> >         7fff811ba935 do_page_mkwrite ([kernel.kallsyms])
>>>> >         7fff811bda74 handle_pte_fault ([kernel.kallsyms])
>>>> >         7fff811c041b handle_mm_fault ([kernel.kallsyms])
>>>> >         7fff8106adbe __do_page_fault ([kernel.kallsyms])
>>>> >         7fff8106b0c0 do_page_fault ([kernel.kallsyms])
>>>> >         7fff8171af48 page_fault ([kernel.kallsyms])
>>>> >         ---- java stack trace ends here ----
>>>>
>>>> I am pretty sure out-of-band page fault in Java thread does not yield a
>>>> safepoint. At least because
>>>> safepoint polls happen at given location in the generated code, because
>>>> we need the pointer map as
>>>> the part of the machine state, and that is generated by Hotspot (only)
>>>> around the safepoint polls.
>>>> Page faulting on random read/write insns does not have that luxury.
>>>> Even if JVM had intercepted that
>>>> fault, there is not enough metadata to work on.
>>>>
>>>> The stacktrace above seems to say you have page faulted and this
>>>> incurred disk I/O? This is
>>>> swapping, I think, and all performance bets are off at that point.
>>>>
>>>> Thanks,
>>>> -Aleksey
>>>>
>>>>
>> --
>>
>> You received this message because you are subscribed to the Google Groups
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>>
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "mechanical-sympathy" group.
>> To unsubscribe from this topic, visit https://groups.google.com/d/to
>> pic/mechanical-sympathy/tepoA7PRFRU/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to