On Tuesday, December 5, 2017 at 1:26:23 PM UTC+1, Mark Price wrote:
> That (each process having it's own copy) is surprising to me. Unless the 
> mapping is such that private copies are required, I'd expect the processes to 
> share the page cache entries.
> 
> 
> I can't recreate this effect locally using FileChannel.map(); the library in 
> use in the application uses a slightly more exotic route to get to mmap, so 
> it could be a bug there; will investigate. I could also have been imagining 
> it.
> 
>  
> 
>  
> 
> Is your pre-toucher thread a Java thread doing it's pre-touching using mapped 
> i/o in the same process? If so, then the pre-toucher thread itself will be a 
> high TTSP causer. The trick is to do the pre-touch in a thread that is 
> already at a safepoint (e.g. do your pre-touch using mapped i/o from within a 
> JNI call, use another process, or do the retouch with non-mapped i/o).
> 
> 
> Yes, just a java thread in the same process; I hadn't considered that it 
> would also cause long TTSP, but of course it's just as likely (or more 
> likely) to be scheduled off due to a page fault. I could try using pwrite via 
> FileChannel.write() to do the pre-touching, but I think it needs to perform a 
> CAS (i.e. don't overwrite data that is already present), so a JNI method 
> would be the only way to go. Unless just doing a 
> FileChannel.position(writeLimit).read(buffer) would do the job? Presumably 
> that is enough to load the page into the cache and performing a write is 
> unnecessary.

This (non mapped reading at the write limit) will work to eliminate the actual 
page I/O impact on TTSP, but the time update path with the lock that you show 
in your initial stack trace will probably still hit you. I’d go either with a 
JNI CAS, or a forked-off mapped Java pretoucher as a separate process (tell it 
what you wNt touched via its stdin). Not sure which one is uglier. The pure 
java is more portable (for Unix/Linux variants at least)

> 
>  
> 
>  
> 
> 
> Cheers,
> 
> Mark
> 
> On Tuesday, 5 December 2017 10:53:17 UTC, Gil Tene  wrote:
> Page faults in mapped file i/o and counted loops are certainly two common 
> causes of long TTSP. But there are many other paths that *could* cause it as 
> well in HotSpot. Without catching it and looking at the stack trace, it's 
> hard to know which ones to blame. Once you knock out one cause, you'll see if 
> there is another.
> 
> 
> In the specific stack trace you showed [assuming that trace was taken during 
> a long TTSP], mapped file i/o is the most likely culprit. Your trace seems to 
> be around making the page write-able for the first time and updating the file 
> time (which takes a lock), but even without needing the lock, the fault 
> itself could end up waiting for the i/o to complete (read page from disk), 
> and that (when Murphy pays you a visit) can end up waiting behind 100s other 
> i/o operations (e.g. when your i/o happens at the same time the kernel 
> decided to flush some dirty pages in the cache), leading to TTSPs in the 100s 
> of msec.
> 
> 
> As I'm sure you already know, one simple way to get around mapped file 
> related TTSP is to not used mapped files. Explicit random i/o calls are 
> always done while at a safepoint, so they can't cause high TTSPs.
> 
> On Tuesday, December 5, 2017 at 10:30:57 AM UTC+1, Mark Price wrote:
> Hi Aleksey,
> thanks for the response. The I/O is definitely one problem, but I was trying 
> to figure out whether it was contributing to the long TTSP times, or whether 
> I might have some code that was misbehaving (e.g. NonCountedLoops).
> 
> Your response aligns with my guesswork, so hopefully I just have the one 
> problem to solve ;)
> 
> 
> 
> Cheers,
> 
> Mark
> 
> On Tuesday, 5 December 2017 09:24:33 UTC, Aleksey Shipilev  wrote:On 
> 12/05/2017 09:26 AM, Mark Price wrote:
> 
> > I'm investigating some long time-to-safepoint pauses in oracle/openjdk. The 
> > application in question
> 
> > is also suffering from some fairly nasty I/O problems where 
> > latency-sensitive threads are being
> 
> > descheduled in uninterruptible sleep state due to needing a file-system 
> > lock.
> 
> > 
> 
> > My question: can the JVM detect that a thread is in 
> > signal/interrupt-handler code and thus treat it
> 
> > as though it is at a safepoint (as I believe happens when a thread is in 
> > native code via a JNI call)?
> 
> > 
> 
> > For instance, given the stack trace below, will the JVM need to wait for 
> > the thread to be scheduled
> 
> > back on to CPU in order to come to a safepoint, or will it be treated as 
> > "in-native"?
> 
> > 
> 
> >         7fff81714cd9 __schedule ([kernel.kallsyms])
> 
> >         7fff817151e5 schedule ([kernel.kallsyms])
> 
> >         7fff81717a4b rwsem_down_write_failed ([kernel.kallsyms])
> 
> >         7fff813556e7 call_rwsem_down_write_failed ([kernel.kallsyms])
> 
> >         7fff817172ad down_write ([kernel.kallsyms])
> 
> >         7fffa0403dcf xfs_ilock ([kernel.kallsyms])
> 
> >         7fffa04018fe xfs_vn_update_time ([kernel.kallsyms])
> 
> >         7fff8122cc5d file_update_time ([kernel.kallsyms])
> 
> >         7fffa03f7183 xfs_filemap_page_mkwrite ([kernel.kallsyms])
> 
> >         7fff811ba935 do_page_mkwrite ([kernel.kallsyms])
> 
> >         7fff811bda74 handle_pte_fault ([kernel.kallsyms])
> 
> >         7fff811c041b handle_mm_fault ([kernel.kallsyms])
> 
> >         7fff8106adbe __do_page_fault ([kernel.kallsyms])
> 
> >         7fff8106b0c0 do_page_fault ([kernel.kallsyms])
> 
> >         7fff8171af48 page_fault ([kernel.kallsyms])
> 
> >         ---- java stack trace ends here ----
> 
> 
> 
> I am pretty sure out-of-band page fault in Java thread does not yield a 
> safepoint. At least because
> 
> safepoint polls happen at given location in the generated code, because we 
> need the pointer map as
> 
> the part of the machine state, and that is generated by Hotspot (only) around 
> the safepoint polls.
> 
> Page faulting on random read/write insns does not have that luxury. Even if 
> JVM had intercepted that
> 
> fault, there is not enough metadata to work on.
> 
> 
> 
> The stacktrace above seems to say you have page faulted and this incurred 
> disk I/O? This is
> 
> swapping, I think, and all performance bets are off at that point.
> 
> 
> 
> Thanks,
> 
> -Aleksey

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to