Great, plenty of things to try :)

Thanks for your input everyone.




On Tuesday, 5 December 2017 12:39:22 UTC, Gil Tene wrote:
>
>
>
> Sent from my iPad
>
> On Dec 5, 2017, at 1:26 PM, Mark Price <[email protected] 
> <javascript:>> wrote:
>
>
>> That (each process having it's own copy) is surprising to me. Unless the 
>> mapping is such that private copies are required, I'd expect the processes 
>> to share the page cache entries.
>>
>
> I can't recreate this effect locally using FileChannel.map(); the library 
> in use in the application uses a slightly more exotic route to get to mmap, 
> so it could be a bug there; will investigate. I could also have been 
> imagining it.
>  
>
>>  
>>
>> Is your pre-toucher thread a Java thread doing it's pre-touching using 
>> mapped i/o in the same process? If so, then the pre-toucher thread itself 
>> will be a high TTSP causer. The trick is to do the pre-touch in a thread 
>> that is already at a safepoint (e.g. do your pre-touch using mapped i/o 
>> from within a JNI call, use another process, or do the retouch with 
>> non-mapped i/o).
>>
>
> Yes, just a java thread in the same process; I hadn't considered that it 
> would also cause long TTSP, but of course it's just as likely (or more 
> likely) to be scheduled off due to a page fault. I could try using pwrite 
> via FileChannel.write() to do the pre-touching, but I think it needs to 
> perform a CAS (i.e. don't overwrite data that is already present), so a JNI 
> method would be the only way to go. Unless just doing a 
> FileChannel.position(writeLimit).read(buffer) would do the job? Presumably 
> that is enough to load the page into the cache and performing a write is 
> unnecessary.
>
>
> This (non mapped reading at the write limit) will work to eliminate the 
> actual page I/O impact on TTSP, but the time update path with the lock that 
> you show in your initial stack trace will probably still hit you. I’d go 
> either with a JNI CAS, or a forked-off mapped Java pretoucher as a separate 
> process (tell it what you wNt touched via its stdin). Not sure which one is 
> uglier. The pure java is more portable (for Unix/Linux variants at least) 
>
>  
>
>>  
>>
>>>
>>>
>>> Cheers,
>>>
>>> Mark
>>>
>>> On Tuesday, 5 December 2017 10:53:17 UTC, Gil Tene wrote: 
>>>>
>>>> Page faults in mapped file i/o and counted loops are certainly two 
>>>> common causes of long TTSP. But there are many other paths that *could* 
>>>> cause it as well in HotSpot. Without catching it and looking at the stack 
>>>> trace, it's hard to know which ones to blame. Once you knock out one 
>>>> cause, 
>>>> you'll see if there is another. 
>>>>
>>>> In the specific stack trace you showed [assuming that trace was taken 
>>>> during a long TTSP], mapped file i/o is the most likely culprit. Your 
>>>> trace 
>>>> seems to be around making the page write-able for the first time and 
>>>> updating the file time (which takes a lock), but even without needing the 
>>>> lock, the fault itself could end up waiting for the i/o to complete (read 
>>>> page from disk), and that (when Murphy pays you a visit) can end up 
>>>> waiting 
>>>> behind 100s other i/o operations (e.g. when your i/o happens at the same 
>>>> time the kernel decided to flush some dirty pages in the cache), leading 
>>>> to 
>>>> TTSPs in the 100s of msec.
>>>>
>>>> As I'm sure you already know, one simple way to get around mapped file 
>>>> related TTSP is to not used mapped files. Explicit random i/o calls are 
>>>> always done while at a safepoint, so they can't cause high TTSPs.
>>>>
>>>> On Tuesday, December 5, 2017 at 10:30:57 AM UTC+1, Mark Price wrote: 
>>>>>
>>>>> Hi Aleksey,
>>>>> thanks for the response. The I/O is definitely one problem, but I was 
>>>>> trying to figure out whether it was contributing to the long TTSP times, 
>>>>> or 
>>>>> whether I might have some code that was misbehaving (e.g. 
>>>>> NonCountedLoops).
>>>>>
>>>>> Your response aligns with my guesswork, so hopefully I just have the 
>>>>> one problem to solve ;)
>>>>>
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Mark
>>>>>
>>>>> On Tuesday, 5 December 2017 09:24:33 UTC, Aleksey Shipilev wrote: 
>>>>>>
>>>>>> On 12/05/2017 09:26 AM, Mark Price wrote: 
>>>>>> > I'm investigating some long time-to-safepoint pauses in 
>>>>>> oracle/openjdk. The application in question 
>>>>>> > is also suffering from some fairly nasty I/O problems where 
>>>>>> latency-sensitive threads are being 
>>>>>> > descheduled in uninterruptible sleep state due to needing a 
>>>>>> file-system lock. 
>>>>>> > 
>>>>>> > My question: can the JVM detect that a thread is in 
>>>>>> signal/interrupt-handler code and thus treat it 
>>>>>> > as though it is at a safepoint (as I believe happens when a thread 
>>>>>> is in native code via a JNI call)? 
>>>>>> > 
>>>>>> > For instance, given the stack trace below, will the JVM need to 
>>>>>> wait for the thread to be scheduled 
>>>>>> > back on to CPU in order to come to a safepoint, or will it be 
>>>>>> treated as "in-native"? 
>>>>>> > 
>>>>>> >         7fff81714cd9 __schedule ([kernel.kallsyms]) 
>>>>>> >         7fff817151e5 schedule ([kernel.kallsyms]) 
>>>>>> >         7fff81717a4b rwsem_down_write_failed ([kernel.kallsyms]) 
>>>>>> >         7fff813556e7 call_rwsem_down_write_failed 
>>>>>> ([kernel.kallsyms]) 
>>>>>> >         7fff817172ad down_write ([kernel.kallsyms]) 
>>>>>> >         7fffa0403dcf xfs_ilock ([kernel.kallsyms]) 
>>>>>> >         7fffa04018fe xfs_vn_update_time ([kernel.kallsyms]) 
>>>>>> >         7fff8122cc5d file_update_time ([kernel.kallsyms]) 
>>>>>> >         7fffa03f7183 xfs_filemap_page_mkwrite ([kernel.kallsyms]) 
>>>>>> >         7fff811ba935 do_page_mkwrite ([kernel.kallsyms]) 
>>>>>> >         7fff811bda74 handle_pte_fault ([kernel.kallsyms]) 
>>>>>> >         7fff811c041b handle_mm_fault ([kernel.kallsyms]) 
>>>>>> >         7fff8106adbe __do_page_fault ([kernel.kallsyms]) 
>>>>>> >         7fff8106b0c0 do_page_fault ([kernel.kallsyms]) 
>>>>>> >         7fff8171af48 page_fault ([kernel.kallsyms]) 
>>>>>> >         ---- java stack trace ends here ---- 
>>>>>>
>>>>>> I am pretty sure out-of-band page fault in Java thread does not yield 
>>>>>> a safepoint. At least because 
>>>>>> safepoint polls happen at given location in the generated code, 
>>>>>> because we need the pointer map as 
>>>>>> the part of the machine state, and that is generated by Hotspot 
>>>>>> (only) around the safepoint polls. 
>>>>>> Page faulting on random read/write insns does not have that luxury. 
>>>>>> Even if JVM had intercepted that 
>>>>>> fault, there is not enough metadata to work on. 
>>>>>>
>>>>>> The stacktrace above seems to say you have page faulted and this 
>>>>>> incurred disk I/O? This is 
>>>>>> swapping, I think, and all performance bets are off at that point. 
>>>>>>
>>>>>> Thanks, 
>>>>>> -Aleksey 
>>>>>>
>>>>>> -- 
> You received this message because you are subscribed to a topic in the 
> Google Groups "mechanical-sympathy" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/mechanical-sympathy/tepoA7PRFRU/unsubscribe
> .
> To unsubscribe from this group and all its topics, send an email to 
> [email protected] <javascript:>.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to