You should also consider the CPU you are using. Westmere CPUs have far fewer (but disjoint/complementary) TLB slots for huge pages, so large sparse working sets (with good temporal locality but poor spatial locality) could incur a significant negative performance penalty on these CPUs with huge pages.
On 9 August 2017 at 14:06, Henri Tremblay <[email protected]> wrote: > From what I see here I would deduce: > 1- THP can give a huge performance gain (when using PreTouch, in some > cases, possibly when not playing with offheap too much) > 2- But it will increase hiccups > > A bit like the throughput collector. > > So my current take away is: > > - Use THP is you care about throughput only > - If you care about latency, just don't > - If you really care about throughput, use non-transparent huge pages > > Is that accurate? > > On 9 August 2017 at 04:51, Peter Veentjer <[email protected]> wrote: > >> Thanks for your very useful replies Gil. >> >> Question: >> >> Using huge pages can give a big performance boost: >> >> https://shipilev.net/jvm-anatomy-park/2-transparent-huge-pages/ >> >> $ time java -Xms4T -Xmx4T -XX:-UseTransparentHugePages -XX:+AlwaysPreTouch >> real 13m58.167s >> user 43m37.519s >> sys 1011m25.740s >> >> $ time java -Xms4T -Xmx4T -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch >> real 2m14.758s >> user 1m56.488s >> sys 73m59.046s >> >> >> >> But THP seems to be unusable. Does this effectively mean that we can't >> benefit from THP under Linux? >> >> Till so far it looks like a damned if you do, damned if you don't >> situation. >> >> Or should we move to non transparent huge pages? >> >> On Tuesday, August 8, 2017 at 7:44:25 PM UTC+3, Gil Tene wrote: >>> >>> >>> >>> On Monday, August 7, 2017 at 11:50:27 AM UTC-7, Alen Vrečko wrote: >>>> >>>> Saw this a while back. >>>> >>>> https://shipilev.net/jvm-anatomy-park/2-transparent-huge-pages/ >>>> >>>> Basically using THP/defrag with madvise and using >>>> -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch JVM opts. >>>> >>>> Looks like the defrag cost should be paid in full at startup due to >>>> AlwaysPreTouch. Never got around to test this in production. Just have >>>> THP disabled. Thoughts? >>>> >>> >>> The above flags would only cover the Java heap. In a Java application. >>> So obviously THP for non-Java things doesn't get helped by that. >>> >>> And for Java stuff, unfortunately, there are lots of non-Java-heap >>> things that are exposed to THP's potentially huge on-demand faulting >>> latencies. The JVM manages lots memory outside of the Java heap for various >>> things (GC stuff, stacks, Metaspace, Code cache, JIT compiler things, and a >>> whole bunch of runtime stuff), and the application itself will often be >>> using off-heap stuff intentionally (e.g. via DirectByteBuffers) or >>> inadvertently (e.g. when libraries make either temporary or lasting use of >>> off-heap memory). E.g. even simple socket I/O involves some use of off heap >>> memory as an intermediate storage location. >>> >>> As a simple demonstration of why THP artifacts for non-Java-heaps are a >>> key problem for Java apps, I first ran into these THP issues by experience, >>> with Zing, right around the time that RHEL 6 turned it on. We found out the >>> hard way that we have to turn it off to maintain reasonable latency >>> profiles. And since Zing has always *ensured* 2MB pages are used for >>> everything in the heap, the code cache, and for virtually all GC support >>> structures, it is clearly the THP impact on all the rest of the stuff that >>> has caused us to deal with it and recommend against it's use. The way we >>> see THP manifest regularly ()when left on) is with occasional huge TTSPs >>> (time to safepoint) [huge in Zing terms, meaning anything from a few msec >>> to 100s of msec], which we know are there because we specifically log and >>> chart TTSPs. But high TTSPs are just a symptom: since we only measure TTSP >>> when we actually try to bring threads to safepoints, and doing that is a >>> relatively (dynamically) rare event. that means that whenever we see actual >>> high TTSPs in our logs, it is likely that similar-sized disruptions are >>> occurring at the application level, but at a much higher frequently than >>> than of the high TTSPs we observe. >>> >>> >>>> - Alen >>>> >>>> 2017-08-07 20:14 GMT+02:00 Gil Tene <[email protected]>: >>>> > To highlight the problematic path in current THP kernel >>>> implementations, >>>> > here is an example call trace that can happen (pulled from the >>>> discussion >>>> > linked to below). It shows that a simple on-demand page fault in >>>> regular >>>> > anonymous memory (e.g. when a normal malloc call is made and >>>> manipulates a >>>> > malloc-mananged 2MB area, or when such the resulting malloc'ed struct >>>> is >>>> > written to) can end up compacting an entire zone (which can be the >>>> vast >>>> > majority of system memory) in a single call, using the faulting >>>> thread. The >>>> > specific example stack trace is taken from a situation where that >>>> fault took >>>> > so long (on a NUMA system) that a soft0lockup was triggered, showing >>>> the >>>> > call took longer than 22 seconds (!!!). But even without the NUMA or >>>> > migrate_pages aspects, compassion of a single zone can take 100s of >>>> msec or >>>> > more. >>>> > >>>> > Browsing through the current kernel code (e.g. >>>> > http://elixir.free-electrons.com/linux/latest/source/mm/comp >>>> action.c#L1722) >>>> > seems to show that this is still the likely path that would be taken >>>> when no >>>> > free 2MB pages are found in current kernels :-( >>>> > >>>> > And this situation will naturally occur under all sorts of common >>>> timing g >>>> > conditions (i/o fragmenting free memory to 4KB (but no 2M), >>>> background >>>> > compaction/defrag falls behind during some heavy kernel-driven i/o >>>> spike, >>>> > and some unlucky thread doing a malloc when the 2MB physical free >>>> list >>>> > exhausted) >>>> > >>>> > >>>> > kernel: Call Trace: >>>> > kernel: [<ffffffff81179d8f>] compaction_alloc+0x1cf/0x240 >>>> > kernel: [<ffffffff811b15ce>] migrate_pages+0xce/0x610 >>>> > kernel: [<ffffffff81179bc0>] ? isolate_freepages_block+0x380/0x380 >>>> > kernel: [<ffffffff8117abb9>] compact_zone+0x299/0x400 >>>> > kernel: [<ffffffff8117adbc>] compact_zone_order+0x9c/0xf0 >>>> > kernel: [<ffffffff8117b171>] try_to_compact_pages+0x121/0x1a0 >>>> > kernel: [<ffffffff815ff336>] __alloc_pages_direct_compact+0xac/0x196 >>>> > kernel: [<ffffffff81160758>] __alloc_pages_nodemask+0x788/0xb90 >>>> > kernel: [<ffffffff810b11c0>] ? task_numa_fault+0x8d0/0xbb0 >>>> > kernel: [<ffffffff811a24aa>] alloc_pages_vma+0x9a/0x140 >>>> > kernel: [<ffffffff811b674b>] do_huge_pmd_anonymous_page+0x10b/0x410 >>>> > kernel: [<ffffffff81182334>] handle_mm_fault+0x184/0xd60 >>>> > kernel: [<ffffffff8160f1e6>] __do_page_fault+0x156/0x520 >>>> > kernel: [<ffffffff8118a945>] ? change_protection+0x65/0xa0 >>>> > kernel: [<ffffffff811a0dbb>] ? change_prot_numa+0x1b/0x40 >>>> > kernel: [<ffffffff810adb86>] ? task_numa_work+0x266/0x300 >>>> > kernel: [<ffffffff8160f5ca>] do_page_fault+0x1a/0x70 >>>> > kernel: [<ffffffff81013b0c>] ? do_notify_resume+0x9c/0xb0 >>>> > kernel: [<ffffffff8160b808>] page_fault+0x28/0x30 >>>> > >>>> > >>>> > On Monday, August 7, 2017 at 10:25:50 AM UTC-7, Gil Tene wrote: >>>> >> >>>> >> THP certainly sits in my "just don't do it" list of tuning things >>>> due to >>>> >> it's fundamental dramatic latency disruption in current >>>> implementations, >>>> >> seen as occasional 10s to 100s of msec (and sometimes even 1sec+) >>>> stalls on >>>> >> something as simple and common as a 32 byte malloc. THP is a form of >>>> >> in-kernel GC. And the current THP implementation involves potential >>>> and >>>> >> occasional synchronous, stop-the-world compaction done at >>>> allocation-time, >>>> >> on or by any application thread that does an mmap or a malloc. >>>> >> >>>> >> I dug up an e-mail I wrote on the subject (to a recipient on this >>>> list) >>>> >> back in Jan 2013 [see below]. While it has some specific links >>>> (including a >>>> >> stack trace showing the kernel de-fragging the whole system on a >>>> single mmap >>>> >> call), note that this material is now 4.5 years old, and things >>>> *might* have >>>> >> changed or improved to some degree. While I've seen no recent >>>> first-hand >>>> >> evidence of efforts to improve things on the >>>> don't-dramatically-stall-malloc >>>> >> (or other mappings) front, I haven't been following it very closely >>>> (I just >>>> >> wrote it off as "lets check again in 5 years"). If someone else here >>>> knows >>>> >> of some actual improvements to this picture in recent years, or of >>>> efforts >>>> >> or discussions in the Linux Kernel community on this subject, please >>>> point >>>> >> to them. >>>> >> >>>> >> IMO, the notion of THP is not flawed. The implementation is. And I >>>> believe >>>> >> that the implementation of THP *can* be improved to be much more >>>> robust and >>>> >> to avoid forcing occasional huge latency artifacts on >>>> memory-allocating >>>> >> threads: >>>> >> >>>> >> 1. the first (huge) step in improving thing would be to >>>> >> never-ever-ever-ever have a mapping thread spend any time performing >>>> any >>>> >> kind of defragmentation, and to simply accept 4KB mappings when no >>>> 2MB >>>> >> physical pages are available. Let background defragmentation do all >>>> the work >>>> >> (including converting 4KLB-allocated-but-2MB-contiguous ranges to >>>> 2MB >>>> >> mappings). >>>> >> >>>> >> 2. The second level (much needed, but at an order of magnitude of >>>> 10s of >>>> >> milliseconds rather than the current 100s of msec or more) would be >>>> to make >>>> >> background defragmentation work without stalling foreground access >>>> to a >>>> >> currently-being-defragmented 2MB region. I.e. don't stall access for >>>> the >>>> >> duration of a 2MB defrag operation (which can take several msec). >>>> >> >>>> >> While both of these are needed for a "don't worry about it" mode of >>>> use >>>> >> (which something called "transparent": really should aim for), #1 is >>>> a much >>>> >> easier step than #2 is. Without it, THP can cause application pauses >>>> (to any >>>> >> linux app) that are often worse than e.g. HotSpot Java GC pauses. >>>> Which is >>>> >> ironic. >>>> >> >>>> >> -- Gil. >>>> >> >>>> >> ------------------------------- >>>> >> >>>> >> The problem is not the background defrag operation. The problem is >>>> >> synchronous defragging done on allocation, where THP on means a 2MB >>>> >> allocation will attempt to allocate a 2MB contiguous page, and if it >>>> can't >>>> >> find one, it may end up defragging an entire zone before the >>>> allocation >>>> >> completes. The /sys/kernel/mm/transparent_hugepage/defrag setting >>>> only >>>> >> controls the background... >>>> >> >>>> >> Here is something I wrote up on it internally after much >>>> investigation: >>>> >> >>>> >> Transparent huge pages (THP) is a feature Red Hat championed and >>>> >> introduced in RHEL 6.x, and got into the upstream kernel around the >>>> ~2.6.38 >>>> >> time, it generally exists in all Linux 3.x kernels and beyond (so it >>>> exists >>>> >> in both SLES 11 SP@ and in Ubuntu 12.04 LTS). With transparent huge >>>> pages, >>>> >> the kernel *attempts* to use 2MB page mappings to map contiguous and >>>> aligned >>>> >> memory ranges of that size (which are quite common for many program >>>> >> scenarios), but will break those into 4KB mappings when needed (e.g. >>>> cannot >>>> >> satisfy with 2MB pages, or when it needs to swap or page out the >>>> memory, >>>> >> since paging is done 4KB at a time). With such a mixed approach, >>>> some sort >>>> >> of a "defragmenter" or "compactor" is required to exist, because >>>> without it >>>> >> simple fragmentation will (over time) make 2MB contiguous physical >>>> pages a >>>> >> rare thing, and performance will tend to degrade over time. As a >>>> result, and >>>> >> in order to support THP, Linux kernels will attempt to defragment >>>> (or >>>> >> "compact") memory and memory zones. This can be done either by >>>> unmapping >>>> >> pages, copying their contents to a new compacted space, and mapping >>>> them in >>>> >> the new location, or by potentially forcing individually mapped 4KB >>>> pages in >>>> >> a 2MB physical page out (via swapping or by paging them out if they >>>> are file >>>> >> system pages), and reclaiming the 2MB contiguous page when that is >>>> done. 4KB >>>> >> pages that were forced out will come back in as needed (swapped back >>>> in on >>>> >> demand, or paged back in on demand). >>>> >> >>>> >> Defragmentation/compaction with THP can happen in two places: >>>> >> >>>> >> 1. First, there is a background defragmenter (a process called >>>> >> "khugepaged") that goes around and compacts 2MB physical pages by >>>> pushing >>>> >> their 4KB pages out when possible. This background defragger could >>>> >> potentially cause pages to swapped out if swapping is enabled, even >>>> with no >>>> >> swapping pressure in place. >>>> >> >>>> >> 2. "Synchronous Compaction": In some cases, an on demand page fault >>>> (e.g. >>>> >> when first accessing a newly alloocated 4KB page created via mmap() >>>> or >>>> >> malloc()) could end up trying to compact memory in order to fault >>>> into a 2MB >>>> >> physical page instead of a 4KB page (this can be seen in the stack >>>> trace >>>> >> discussed in this posting, for example: >>>> >> https://access.redhat.com/solutions/1560893). When this happens, a >>>> single >>>> >> 4KB allocation could end up waiting for an attempt to compact an >>>> entire >>>> >> "zone" of pages, even if those are compacted purely thru in-memory >>>> moves >>>> >> with no I/O. It can also be blocked waiting for disk I/O as seen on >>>> some >>>> >> stack traces in related discussions. >>>> >> >>>> >> More details can be found in places like this: >>>> >> http://www.mjmwired.net/kernel/Documentation/vm/transhuge.txt >>>> >> http://www.linux-kvm.org/wiki/images/9/9e/2010-forum-thp.pdf >>>> >> >>>> >> And examples for cases of avoiding thrashing by disabling THP ion >>>> RHEL 6.2 >>>> >> are around: >>>> >> >>>> >> >>>> >> http://oaktable.net/content/linux-6-transparent-huge-pages-a >>>> nd-hadoop-workloads >>>> >> >>>> >> >>>> >> http://techaticpsr.blogspot.com/2012/04/its-official-we-have >>>> -no-love-for.html >>>> >> >>>> >> BOTTOM LINE: Transparent Huge Pages is a well-inteded idea that >>>> helps >>>> >> compact physical memory and use more optimal mappings in the kernel, >>>> but it >>>> >> can come with some significant (and often surprising) latency >>>> impacts. I >>>> >> recommend we turn it off by default in Zing installations, and it >>>> appears >>>> >> that many other software packages (including most DBs, and many Java >>>> based >>>> >> apps) recommend the same. >>>> > >>>> > -- >>>> > You received this message because you are subscribed to the Google >>>> Groups >>>> > "mechanical-sympathy" group. >>>> > To unsubscribe from this group and stop receiving emails from it, >>>> send an >>>> > email to [email protected]. >>>> > For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "mechanical-sympathy" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to the Google Groups > "mechanical-sympathy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
