> > > That (each process having it's own copy) is surprising to me. Unless the > mapping is such that private copies are required, I'd expect the processes > to share the page cache entries. >
I can't recreate this effect locally using FileChannel.map(); the library in use in the application uses a slightly more exotic route to get to mmap, so it could be a bug there; will investigate. I could also have been imagining it. > > > Is your pre-toucher thread a Java thread doing it's pre-touching using > mapped i/o in the same process? If so, then the pre-toucher thread itself > will be a high TTSP causer. The trick is to do the pre-touch in a thread > that is already at a safepoint (e.g. do your pre-touch using mapped i/o > from within a JNI call, use another process, or do the retouch with > non-mapped i/o). > Yes, just a java thread in the same process; I hadn't considered that it would also cause long TTSP, but of course it's just as likely (or more likely) to be scheduled off due to a page fault. I could try using pwrite via FileChannel.write() to do the pre-touching, but I think it needs to perform a CAS (i.e. don't overwrite data that is already present), so a JNI method would be the only way to go. Unless just doing a FileChannel.position(writeLimit).read(buffer) would do the job? Presumably that is enough to load the page into the cache and performing a write is unnecessary. > > >> >> >> Cheers, >> >> Mark >> >> On Tuesday, 5 December 2017 10:53:17 UTC, Gil Tene wrote: >>> >>> Page faults in mapped file i/o and counted loops are certainly two >>> common causes of long TTSP. But there are many other paths that *could* >>> cause it as well in HotSpot. Without catching it and looking at the stack >>> trace, it's hard to know which ones to blame. Once you knock out one cause, >>> you'll see if there is another. >>> >>> In the specific stack trace you showed [assuming that trace was taken >>> during a long TTSP], mapped file i/o is the most likely culprit. Your trace >>> seems to be around making the page write-able for the first time and >>> updating the file time (which takes a lock), but even without needing the >>> lock, the fault itself could end up waiting for the i/o to complete (read >>> page from disk), and that (when Murphy pays you a visit) can end up waiting >>> behind 100s other i/o operations (e.g. when your i/o happens at the same >>> time the kernel decided to flush some dirty pages in the cache), leading to >>> TTSPs in the 100s of msec. >>> >>> As I'm sure you already know, one simple way to get around mapped file >>> related TTSP is to not used mapped files. Explicit random i/o calls are >>> always done while at a safepoint, so they can't cause high TTSPs. >>> >>> On Tuesday, December 5, 2017 at 10:30:57 AM UTC+1, Mark Price wrote: >>>> >>>> Hi Aleksey, >>>> thanks for the response. The I/O is definitely one problem, but I was >>>> trying to figure out whether it was contributing to the long TTSP times, >>>> or >>>> whether I might have some code that was misbehaving (e.g. NonCountedLoops). >>>> >>>> Your response aligns with my guesswork, so hopefully I just have the >>>> one problem to solve ;) >>>> >>>> >>>> >>>> Cheers, >>>> >>>> Mark >>>> >>>> On Tuesday, 5 December 2017 09:24:33 UTC, Aleksey Shipilev wrote: >>>>> >>>>> On 12/05/2017 09:26 AM, Mark Price wrote: >>>>> > I'm investigating some long time-to-safepoint pauses in >>>>> oracle/openjdk. The application in question >>>>> > is also suffering from some fairly nasty I/O problems where >>>>> latency-sensitive threads are being >>>>> > descheduled in uninterruptible sleep state due to needing a >>>>> file-system lock. >>>>> > >>>>> > My question: can the JVM detect that a thread is in >>>>> signal/interrupt-handler code and thus treat it >>>>> > as though it is at a safepoint (as I believe happens when a thread >>>>> is in native code via a JNI call)? >>>>> > >>>>> > For instance, given the stack trace below, will the JVM need to wait >>>>> for the thread to be scheduled >>>>> > back on to CPU in order to come to a safepoint, or will it be >>>>> treated as "in-native"? >>>>> > >>>>> > 7fff81714cd9 __schedule ([kernel.kallsyms]) >>>>> > 7fff817151e5 schedule ([kernel.kallsyms]) >>>>> > 7fff81717a4b rwsem_down_write_failed ([kernel.kallsyms]) >>>>> > 7fff813556e7 call_rwsem_down_write_failed >>>>> ([kernel.kallsyms]) >>>>> > 7fff817172ad down_write ([kernel.kallsyms]) >>>>> > 7fffa0403dcf xfs_ilock ([kernel.kallsyms]) >>>>> > 7fffa04018fe xfs_vn_update_time ([kernel.kallsyms]) >>>>> > 7fff8122cc5d file_update_time ([kernel.kallsyms]) >>>>> > 7fffa03f7183 xfs_filemap_page_mkwrite ([kernel.kallsyms]) >>>>> > 7fff811ba935 do_page_mkwrite ([kernel.kallsyms]) >>>>> > 7fff811bda74 handle_pte_fault ([kernel.kallsyms]) >>>>> > 7fff811c041b handle_mm_fault ([kernel.kallsyms]) >>>>> > 7fff8106adbe __do_page_fault ([kernel.kallsyms]) >>>>> > 7fff8106b0c0 do_page_fault ([kernel.kallsyms]) >>>>> > 7fff8171af48 page_fault ([kernel.kallsyms]) >>>>> > ---- java stack trace ends here ---- >>>>> >>>>> I am pretty sure out-of-band page fault in Java thread does not yield >>>>> a safepoint. At least because >>>>> safepoint polls happen at given location in the generated code, >>>>> because we need the pointer map as >>>>> the part of the machine state, and that is generated by Hotspot (only) >>>>> around the safepoint polls. >>>>> Page faulting on random read/write insns does not have that luxury. >>>>> Even if JVM had intercepted that >>>>> fault, there is not enough metadata to work on. >>>>> >>>>> The stacktrace above seems to say you have page faulted and this >>>>> incurred disk I/O? This is >>>>> swapping, I think, and all performance bets are off at that point. >>>>> >>>>> Thanks, >>>>> -Aleksey >>>>> >>>>> -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
