Thanks for your very useful replies Gil.

Question:

Using huge pages can give a big performance boost:

https://shipilev.net/jvm-anatomy-park/2-transparent-huge-pages/

$ time java -Xms4T -Xmx4T -XX:-UseTransparentHugePages -XX:+AlwaysPreTouch
real    13m58.167s
user    43m37.519s
sys     1011m25.740s

$ time java -Xms4T -Xmx4T -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
real    2m14.758s
user    1m56.488s
sys     73m59.046s



But THP seems to be unusable. Does this effectively mean that we can't 
benefit from THP under Linux? 

Till so far it looks like a damned if you do, damned if you don't situation.

Or should we move to non transparent huge pages? 

On Tuesday, August 8, 2017 at 7:44:25 PM UTC+3, Gil Tene wrote:
>
>
>
> On Monday, August 7, 2017 at 11:50:27 AM UTC-7, Alen Vrečko wrote:
>>
>> Saw this a while back. 
>>
>> https://shipilev.net/jvm-anatomy-park/2-transparent-huge-pages/ 
>>
>> Basically using THP/defrag with madvise and using 
>> -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch JVM opts. 
>>
>> Looks like the defrag cost should be paid in full at startup due to 
>> AlwaysPreTouch. Never got around to test this in production. Just have 
>> THP disabled. Thoughts? 
>>
>
> The above flags would only cover the Java heap. In a Java application. So 
> obviously THP for non-Java things doesn't get helped by that.
>
> And for Java stuff, unfortunately, there are lots of non-Java-heap things 
> that are exposed to THP's potentially huge on-demand faulting latencies. 
> The JVM manages lots memory outside of the Java heap for various things (GC 
> stuff, stacks, Metaspace, Code cache, JIT compiler things, and a whole 
> bunch of runtime stuff), and the application itself will often be using 
> off-heap stuff intentionally (e.g. via DirectByteBuffers) or inadvertently 
> (e.g. when libraries make either temporary or lasting use of off-heap 
> memory). E.g. even simple socket I/O involves some use of off heap memory 
> as an intermediate storage location.
>
> As a simple demonstration of why THP artifacts for non-Java-heaps are a 
> key problem for Java apps, I first ran into these THP issues by experience, 
> with Zing, right around the time that RHEL 6 turned it on. We found out the 
> hard way that we have to turn it off to maintain reasonable latency 
> profiles. And since Zing has always *ensured* 2MB pages are used for 
> everything in the heap, the code cache, and for virtually all GC support 
> structures, it is clearly the THP impact on all the rest of the stuff that 
> has caused us to deal with it and recommend against it's use. The way we 
> see THP manifest regularly ()when left on) is with occasional huge TTSPs 
> (time to safepoint) [huge in Zing terms, meaning anything from a few msec 
> to 100s of msec], which we know are there because we specifically log and 
> chart TTSPs. But high TTSPs are just a symptom: since we only measure TTSP 
> when we actually try to bring threads to safepoints, and doing that is a 
> relatively (dynamically) rare event. that means that whenever we see actual 
> high TTSPs in our logs, it is likely that similar-sized disruptions are 
> occurring at the application level, but at a much higher frequently than 
> than of the high TTSPs we observe.
>
>
>> - Alen 
>>
>> 2017-08-07 20:14 GMT+02:00 Gil Tene <g...@azul.com>: 
>> > To highlight the problematic path in current THP kernel 
>> implementations, 
>> > here is an example call trace that can happen (pulled from the 
>> discussion 
>> > linked to below). It shows that a simple on-demand page fault in 
>> regular 
>> > anonymous memory (e.g. when a normal malloc call is made and 
>> manipulates a 
>> > malloc-mananged 2MB area, or when such the resulting malloc'ed struct 
>> is 
>> > written to) can end up compacting an entire zone (which can be the vast 
>> > majority of system memory) in a single call, using the faulting thread. 
>> The 
>> > specific example stack trace is taken from a situation where that fault 
>> took 
>> > so long (on a NUMA system) that a soft0lockup was triggered, showing 
>> the 
>> > call took longer than 22 seconds (!!!). But even without the NUMA or 
>> > migrate_pages aspects, compassion of a single zone can take 100s of 
>> msec or 
>> > more. 
>> > 
>> > Browsing through the current kernel code (e.g. 
>> > 
>> http://elixir.free-electrons.com/linux/latest/source/mm/compaction.c#L1722) 
>>
>> > seems to show that this is still the likely path that would be taken 
>> when no 
>> > free 2MB pages are found in current kernels :-( 
>> > 
>> > And this situation will naturally occur under all sorts of common 
>> timing g 
>> > conditions (i/o fragmenting free memory to 4KB (but no 2M), background 
>> > compaction/defrag falls behind during some heavy kernel-driven i/o 
>> spike, 
>> > and some unlucky thread doing a malloc when the 2MB physical free list 
>> > exhausted) 
>> > 
>> > 
>> > kernel: Call Trace: 
>> > kernel: [<ffffffff81179d8f>] compaction_alloc+0x1cf/0x240 
>> > kernel: [<ffffffff811b15ce>] migrate_pages+0xce/0x610 
>> > kernel: [<ffffffff81179bc0>] ? isolate_freepages_block+0x380/0x380 
>> > kernel: [<ffffffff8117abb9>] compact_zone+0x299/0x400 
>> > kernel: [<ffffffff8117adbc>] compact_zone_order+0x9c/0xf0 
>> > kernel: [<ffffffff8117b171>] try_to_compact_pages+0x121/0x1a0 
>> > kernel: [<ffffffff815ff336>] __alloc_pages_direct_compact+0xac/0x196 
>> > kernel: [<ffffffff81160758>] __alloc_pages_nodemask+0x788/0xb90 
>> > kernel: [<ffffffff810b11c0>] ? task_numa_fault+0x8d0/0xbb0 
>> > kernel: [<ffffffff811a24aa>] alloc_pages_vma+0x9a/0x140 
>> > kernel: [<ffffffff811b674b>] do_huge_pmd_anonymous_page+0x10b/0x410 
>> > kernel: [<ffffffff81182334>] handle_mm_fault+0x184/0xd60 
>> > kernel: [<ffffffff8160f1e6>] __do_page_fault+0x156/0x520 
>> > kernel: [<ffffffff8118a945>] ? change_protection+0x65/0xa0 
>> > kernel: [<ffffffff811a0dbb>] ? change_prot_numa+0x1b/0x40 
>> > kernel: [<ffffffff810adb86>] ? task_numa_work+0x266/0x300 
>> > kernel: [<ffffffff8160f5ca>] do_page_fault+0x1a/0x70 
>> > kernel: [<ffffffff81013b0c>] ? do_notify_resume+0x9c/0xb0 
>> > kernel: [<ffffffff8160b808>] page_fault+0x28/0x30 
>> > 
>> > 
>> > On Monday, August 7, 2017 at 10:25:50 AM UTC-7, Gil Tene wrote: 
>> >> 
>> >> THP certainly sits in my "just don't do it" list of tuning things due 
>> to 
>> >> it's fundamental dramatic latency disruption in current 
>> implementations, 
>> >> seen as occasional 10s to 100s of msec (and sometimes even 1sec+) 
>> stalls on 
>> >> something as simple and common as a 32 byte malloc. THP is a form of 
>> >> in-kernel GC. And the current THP implementation involves potential 
>> and 
>> >> occasional synchronous, stop-the-world compaction done at 
>> allocation-time, 
>> >> on or by any application thread that does an mmap or a malloc. 
>> >> 
>> >> I dug up an e-mail I wrote on the subject (to a recipient on this 
>> list) 
>> >> back in Jan 2013 [see below]. While it has some specific links 
>> (including a 
>> >> stack trace showing the kernel de-fragging the whole system on a 
>> single mmap 
>> >> call), note that this material is now 4.5 years old, and things 
>> *might* have 
>> >> changed or improved to some degree. While I've seen no recent 
>> first-hand 
>> >> evidence of efforts to improve things on the 
>> don't-dramatically-stall-malloc 
>> >> (or other mappings) front, I haven't been following it very closely (I 
>> just 
>> >> wrote it off as "lets check again in 5 years"). If someone else here 
>> knows 
>> >> of some actual improvements to this picture in recent years, or of 
>> efforts 
>> >> or discussions in the Linux Kernel community on this subject, please 
>> point 
>> >> to them. 
>> >> 
>> >> IMO, the notion of THP is not flawed. The implementation is. And I 
>> believe 
>> >> that the implementation of THP *can* be improved to be much more 
>> robust and 
>> >> to avoid forcing occasional huge latency artifacts on 
>> memory-allocating 
>> >> threads: 
>> >> 
>> >> 1. the first (huge) step in improving thing would be to 
>> >> never-ever-ever-ever have a mapping thread spend any time performing 
>> any 
>> >> kind of defragmentation, and to simply accept 4KB mappings when no 2MB 
>> >> physical pages are available. Let background defragmentation do all 
>> the work 
>> >> (including converting 4KLB-allocated-but-2MB-contiguous ranges to 2MB 
>> >> mappings). 
>> >> 
>> >> 2. The second level (much needed, but at an order of magnitude of 10s 
>> of 
>> >> milliseconds rather than the current 100s of msec or more) would be to 
>> make 
>> >> background defragmentation work without stalling foreground access to 
>> a 
>> >> currently-being-defragmented 2MB region. I.e. don't stall access for 
>> the 
>> >> duration of a 2MB defrag operation (which can take several msec). 
>> >> 
>> >> While both of these are needed for a "don't worry about it" mode of 
>> use 
>> >> (which something called "transparent": really should aim for), #1 is a 
>> much 
>> >> easier step than #2 is. Without it, THP can cause application pauses 
>> (to any 
>> >> linux app) that are often worse than e.g. HotSpot Java GC pauses. 
>> Which is 
>> >> ironic. 
>> >> 
>> >> -- Gil. 
>> >> 
>> >> ------------------------------- 
>> >> 
>> >> The problem is not the background defrag operation. The problem is 
>> >> synchronous defragging done on allocation, where THP on means a 2MB 
>> >> allocation will attempt to allocate a 2MB contiguous page, and if it 
>> can't 
>> >> find one, it may end up defragging an entire zone before the 
>> allocation 
>> >> completes. The /sys/kernel/mm/transparent_hugepage/defrag setting only 
>> >> controls the background... 
>> >> 
>> >> Here is something I wrote up on it internally after much 
>> investigation: 
>> >> 
>> >> Transparent huge pages (THP) is a feature Red Hat championed and 
>> >> introduced in RHEL 6.x, and got into the upstream kernel around the 
>>  ~2.6.38 
>> >> time, it generally exists in all Linux 3.x kernels and beyond (so it 
>> exists 
>> >> in both SLES 11 SP@ and in Ubuntu 12.04 LTS). With transparent huge 
>> pages, 
>> >> the kernel *attempts* to use 2MB page mappings to map contiguous and 
>> aligned 
>> >> memory ranges of that size (which are quite common for many program 
>> >> scenarios), but will break those into 4KB mappings when needed (e.g. 
>> cannot 
>> >> satisfy with 2MB pages, or when it needs to swap or page out the 
>> memory, 
>> >> since paging is done 4KB at a time). With such a mixed approach, some 
>> sort 
>> >> of a "defragmenter" or "compactor" is required to exist, because 
>> without it 
>> >> simple fragmentation will (over time) make 2MB contiguous physical 
>> pages a 
>> >> rare thing, and performance will tend to degrade over time. As a 
>> result, and 
>> >> in order to support THP, Linux kernels will attempt to defragment (or 
>> >> "compact") memory and memory zones. This can be done either by 
>> unmapping 
>> >> pages, copying their contents to a new compacted space, and mapping 
>> them in 
>> >> the new location, or by potentially forcing individually mapped 4KB 
>> pages in 
>> >> a 2MB physical page out (via swapping or by paging them out if they 
>> are file 
>> >> system pages), and reclaiming the 2MB contiguous page when that is 
>> done. 4KB 
>> >> pages that were forced out will come back in as needed (swapped back 
>> in on 
>> >> demand, or paged back in on demand). 
>> >> 
>> >> Defragmentation/compaction with THP can happen in two places: 
>> >> 
>> >> 1. First, there is a background defragmenter (a process called 
>> >> "khugepaged") that goes around and compacts 2MB physical pages by 
>> pushing 
>> >> their 4KB pages out when possible. This background defragger could 
>> >> potentially cause pages to swapped out if swapping is enabled, even 
>> with no 
>> >> swapping pressure in place. 
>> >> 
>> >> 2. "Synchronous Compaction": In some cases, an on demand page fault 
>> (e.g. 
>> >> when first accessing a newly alloocated 4KB page created via mmap() or 
>> >> malloc()) could end up trying to compact memory in order to fault into 
>> a 2MB 
>> >> physical page instead of a 4KB page (this can be seen in the stack 
>> trace 
>> >> discussed in this posting, for example: 
>> >> https://access.redhat.com/solutions/1560893). When this happens, a 
>> single 
>> >> 4KB allocation could end up waiting for an attempt to compact an 
>> entire 
>> >> "zone" of pages, even if those are compacted purely thru in-memory 
>> moves 
>> >> with no I/O. It can also be blocked waiting for disk I/O as seen on 
>> some 
>> >> stack traces in related discussions. 
>> >> 
>> >> More details can be found in places like this: 
>> >> http://www.mjmwired.net/kernel/Documentation/vm/transhuge.txt 
>> >> http://www.linux-kvm.org/wiki/images/9/9e/2010-forum-thp.pdf 
>> >> 
>> >> And examples for cases of avoiding thrashing by disabling THP ion RHEL 
>> 6.2 
>> >> are around: 
>> >> 
>> >> 
>> >> 
>> http://oaktable.net/content/linux-6-transparent-huge-pages-and-hadoop-workloads
>>  
>> >> 
>> >> 
>> >> 
>> http://techaticpsr.blogspot.com/2012/04/its-official-we-have-no-love-for.html
>>  
>> >> 
>> >> BOTTOM LINE: Transparent Huge Pages is a well-inteded idea that helps 
>> >> compact physical memory and use more optimal mappings in the kernel, 
>> but it 
>> >> can come with some significant (and often surprising) latency impacts. 
>> I 
>> >> recommend we turn it off by default in Zing installations, and it 
>> appears 
>> >> that many other software packages (including most DBs, and many Java 
>> based 
>> >> apps) recommend the same. 
>> > 
>> > -- 
>> > You received this message because you are subscribed to the Google 
>> Groups 
>> > "mechanical-sympathy" group. 
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an 
>> > email to mechanical-sympathy+unsubscr...@googlegroups.com. 
>> > For more options, visit https://groups.google.com/d/optout. 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to