Re: failing to understand the issues with transparent huge paging

Benedict Elliott Smith Wed, 09 Aug 2017 06:35:57 -0700

You should also consider the CPU you are using.  Westmere CPUs have far
fewer (but disjoint/complementary) TLB slots for huge pages, so large
sparse working sets (with good temporal locality but poor spatial locality)
could incur a significant negative performance penalty on these CPUs with
huge pages.


On 9 August 2017 at 14:06, Henri Tremblay <[email protected]> wrote:

> From what I see here I would deduce:
> 1- THP can give a huge performance gain (when using PreTouch, in some
> cases, possibly when not playing with offheap too much)
> 2- But it will increase hiccups
>
> A bit like the throughput collector.
>
> So my current take away is:
>
>    - Use THP is you care about throughput only
>    - If you care about latency, just don't
>    - If you really care about throughput, use non-transparent huge pages
>
> Is that accurate?
>
> On 9 August 2017 at 04:51, Peter Veentjer <[email protected]> wrote:
>
>> Thanks for your very useful replies Gil.
>>
>> Question:
>>
>> Using huge pages can give a big performance boost:
>>
>> https://shipilev.net/jvm-anatomy-park/2-transparent-huge-pages/
>>
>> $ time java -Xms4T -Xmx4T -XX:-UseTransparentHugePages -XX:+AlwaysPreTouch
>> real    13m58.167s
>> user    43m37.519s
>> sys     1011m25.740s
>>
>> $ time java -Xms4T -Xmx4T -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
>> real    2m14.758s
>> user    1m56.488s
>> sys     73m59.046s
>>
>>
>>
>> But THP seems to be unusable. Does this effectively mean that we can't
>> benefit from THP under Linux?
>>
>> Till so far it looks like a damned if you do, damned if you don't
>> situation.
>>
>> Or should we move to non transparent huge pages?
>>
>> On Tuesday, August 8, 2017 at 7:44:25 PM UTC+3, Gil Tene wrote:
>>>
>>>
>>>
>>> On Monday, August 7, 2017 at 11:50:27 AM UTC-7, Alen Vrečko wrote:
>>>>
>>>> Saw this a while back.
>>>>
>>>> https://shipilev.net/jvm-anatomy-park/2-transparent-huge-pages/
>>>>
>>>> Basically using THP/defrag with madvise and using
>>>> -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch JVM opts.
>>>>
>>>> Looks like the defrag cost should be paid in full at startup due to
>>>> AlwaysPreTouch. Never got around to test this in production. Just have
>>>> THP disabled. Thoughts?
>>>>
>>>
>>> The above flags would only cover the Java heap. In a Java application.
>>> So obviously THP for non-Java things doesn't get helped by that.
>>>
>>> And for Java stuff, unfortunately, there are lots of non-Java-heap
>>> things that are exposed to THP's potentially huge on-demand faulting
>>> latencies. The JVM manages lots memory outside of the Java heap for various
>>> things (GC stuff, stacks, Metaspace, Code cache, JIT compiler things, and a
>>> whole bunch of runtime stuff), and the application itself will often be
>>> using off-heap stuff intentionally (e.g. via DirectByteBuffers) or
>>> inadvertently (e.g. when libraries make either temporary or lasting use of
>>> off-heap memory). E.g. even simple socket I/O involves some use of off heap
>>> memory as an intermediate storage location.
>>>
>>> As a simple demonstration of why THP artifacts for non-Java-heaps are a
>>> key problem for Java apps, I first ran into these THP issues by experience,
>>> with Zing, right around the time that RHEL 6 turned it on. We found out the
>>> hard way that we have to turn it off to maintain reasonable latency
>>> profiles. And since Zing has always *ensured* 2MB pages are used for
>>> everything in the heap, the code cache, and for virtually all GC support
>>> structures, it is clearly the THP impact on all the rest of the stuff that
>>> has caused us to deal with it and recommend against it's use. The way we
>>> see THP manifest regularly ()when left on) is with occasional huge TTSPs
>>> (time to safepoint) [huge in Zing terms, meaning anything from a few msec
>>> to 100s of msec], which we know are there because we specifically log and
>>> chart TTSPs. But high TTSPs are just a symptom: since we only measure TTSP
>>> when we actually try to bring threads to safepoints, and doing that is a
>>> relatively (dynamically) rare event. that means that whenever we see actual
>>> high TTSPs in our logs, it is likely that similar-sized disruptions are
>>> occurring at the application level, but at a much higher frequently than
>>> than of the high TTSPs we observe.
>>>
>>>
>>>> - Alen
>>>>
>>>> 2017-08-07 20:14 GMT+02:00 Gil Tene <[email protected]>:
>>>> > To highlight the problematic path in current THP kernel
>>>> implementations,
>>>> > here is an example call trace that can happen (pulled from the
>>>> discussion
>>>> > linked to below). It shows that a simple on-demand page fault in
>>>> regular
>>>> > anonymous memory (e.g. when a normal malloc call is made and
>>>> manipulates a
>>>> > malloc-mananged 2MB area, or when such the resulting malloc'ed struct
>>>> is
>>>> > written to) can end up compacting an entire zone (which can be the
>>>> vast
>>>> > majority of system memory) in a single call, using the faulting
>>>> thread. The
>>>> > specific example stack trace is taken from a situation where that
>>>> fault took
>>>> > so long (on a NUMA system) that a soft0lockup was triggered, showing
>>>> the
>>>> > call took longer than 22 seconds (!!!). But even without the NUMA or
>>>> > migrate_pages aspects, compassion of a single zone can take 100s of
>>>> msec or
>>>> > more.
>>>> >
>>>> > Browsing through the current kernel code (e.g.
>>>> > http://elixir.free-electrons.com/linux/latest/source/mm/comp
>>>> action.c#L1722)
>>>> > seems to show that this is still the likely path that would be taken
>>>> when no
>>>> > free 2MB pages are found in current kernels :-(
>>>> >
>>>> > And this situation will naturally occur under all sorts of common
>>>> timing g
>>>> > conditions (i/o fragmenting free memory to 4KB (but no 2M),
>>>> background
>>>> > compaction/defrag falls behind during some heavy kernel-driven i/o
>>>> spike,
>>>> > and some unlucky thread doing a malloc when the 2MB physical free
>>>> list
>>>> > exhausted)
>>>> >
>>>> >
>>>> > kernel: Call Trace:
>>>> > kernel: [<ffffffff81179d8f>] compaction_alloc+0x1cf/0x240
>>>> > kernel: [<ffffffff811b15ce>] migrate_pages+0xce/0x610
>>>> > kernel: [<ffffffff81179bc0>] ? isolate_freepages_block+0x380/0x380
>>>> > kernel: [<ffffffff8117abb9>] compact_zone+0x299/0x400
>>>> > kernel: [<ffffffff8117adbc>] compact_zone_order+0x9c/0xf0
>>>> > kernel: [<ffffffff8117b171>] try_to_compact_pages+0x121/0x1a0
>>>> > kernel: [<ffffffff815ff336>] __alloc_pages_direct_compact+0xac/0x196
>>>> > kernel: [<ffffffff81160758>] __alloc_pages_nodemask+0x788/0xb90
>>>> > kernel: [<ffffffff810b11c0>] ? task_numa_fault+0x8d0/0xbb0
>>>> > kernel: [<ffffffff811a24aa>] alloc_pages_vma+0x9a/0x140
>>>> > kernel: [<ffffffff811b674b>] do_huge_pmd_anonymous_page+0x10b/0x410
>>>> > kernel: [<ffffffff81182334>] handle_mm_fault+0x184/0xd60
>>>> > kernel: [<ffffffff8160f1e6>] __do_page_fault+0x156/0x520
>>>> > kernel: [<ffffffff8118a945>] ? change_protection+0x65/0xa0
>>>> > kernel: [<ffffffff811a0dbb>] ? change_prot_numa+0x1b/0x40
>>>> > kernel: [<ffffffff810adb86>] ? task_numa_work+0x266/0x300
>>>> > kernel: [<ffffffff8160f5ca>] do_page_fault+0x1a/0x70
>>>> > kernel: [<ffffffff81013b0c>] ? do_notify_resume+0x9c/0xb0
>>>> > kernel: [<ffffffff8160b808>] page_fault+0x28/0x30
>>>> >
>>>> >
>>>> > On Monday, August 7, 2017 at 10:25:50 AM UTC-7, Gil Tene wrote:
>>>> >>
>>>> >> THP certainly sits in my "just don't do it" list of tuning things
>>>> due to
>>>> >> it's fundamental dramatic latency disruption in current
>>>> implementations,
>>>> >> seen as occasional 10s to 100s of msec (and sometimes even 1sec+)
>>>> stalls on
>>>> >> something as simple and common as a 32 byte malloc. THP is a form of
>>>> >> in-kernel GC. And the current THP implementation involves potential
>>>> and
>>>> >> occasional synchronous, stop-the-world compaction done at
>>>> allocation-time,
>>>> >> on or by any application thread that does an mmap or a malloc.
>>>> >>
>>>> >> I dug up an e-mail I wrote on the subject (to a recipient on this
>>>> list)
>>>> >> back in Jan 2013 [see below]. While it has some specific links
>>>> (including a
>>>> >> stack trace showing the kernel de-fragging the whole system on a
>>>> single mmap
>>>> >> call), note that this material is now 4.5 years old, and things
>>>> *might* have
>>>> >> changed or improved to some degree. While I've seen no recent
>>>> first-hand
>>>> >> evidence of efforts to improve things on the
>>>> don't-dramatically-stall-malloc
>>>> >> (or other mappings) front, I haven't been following it very closely
>>>> (I just
>>>> >> wrote it off as "lets check again in 5 years"). If someone else here
>>>> knows
>>>> >> of some actual improvements to this picture in recent years, or of
>>>> efforts
>>>> >> or discussions in the Linux Kernel community on this subject, please
>>>> point
>>>> >> to them.
>>>> >>
>>>> >> IMO, the notion of THP is not flawed. The implementation is. And I
>>>> believe
>>>> >> that the implementation of THP *can* be improved to be much more
>>>> robust and
>>>> >> to avoid forcing occasional huge latency artifacts on
>>>> memory-allocating
>>>> >> threads:
>>>> >>
>>>> >> 1. the first (huge) step in improving thing would be to
>>>> >> never-ever-ever-ever have a mapping thread spend any time performing
>>>> any
>>>> >> kind of defragmentation, and to simply accept 4KB mappings when no
>>>> 2MB
>>>> >> physical pages are available. Let background defragmentation do all
>>>> the work
>>>> >> (including converting 4KLB-allocated-but-2MB-contiguous ranges to
>>>> 2MB
>>>> >> mappings).
>>>> >>
>>>> >> 2. The second level (much needed, but at an order of magnitude of
>>>> 10s of
>>>> >> milliseconds rather than the current 100s of msec or more) would be
>>>> to make
>>>> >> background defragmentation work without stalling foreground access
>>>> to a
>>>> >> currently-being-defragmented 2MB region. I.e. don't stall access for
>>>> the
>>>> >> duration of a 2MB defrag operation (which can take several msec).
>>>> >>
>>>> >> While both of these are needed for a "don't worry about it" mode of
>>>> use
>>>> >> (which something called "transparent": really should aim for), #1 is
>>>> a much
>>>> >> easier step than #2 is. Without it, THP can cause application pauses
>>>> (to any
>>>> >> linux app) that are often worse than e.g. HotSpot Java GC pauses.
>>>> Which is
>>>> >> ironic.
>>>> >>
>>>> >> -- Gil.
>>>> >>
>>>> >> -------------------------------
>>>> >>
>>>> >> The problem is not the background defrag operation. The problem is
>>>> >> synchronous defragging done on allocation, where THP on means a 2MB
>>>> >> allocation will attempt to allocate a 2MB contiguous page, and if it
>>>> can't
>>>> >> find one, it may end up defragging an entire zone before the
>>>> allocation
>>>> >> completes. The /sys/kernel/mm/transparent_hugepage/defrag setting
>>>> only
>>>> >> controls the background...
>>>> >>
>>>> >> Here is something I wrote up on it internally after much
>>>> investigation:
>>>> >>
>>>> >> Transparent huge pages (THP) is a feature Red Hat championed and
>>>> >> introduced in RHEL 6.x, and got into the upstream kernel around the
>>>>  ~2.6.38
>>>> >> time, it generally exists in all Linux 3.x kernels and beyond (so it
>>>> exists
>>>> >> in both SLES 11 SP@ and in Ubuntu 12.04 LTS). With transparent huge
>>>> pages,
>>>> >> the kernel *attempts* to use 2MB page mappings to map contiguous and
>>>> aligned
>>>> >> memory ranges of that size (which are quite common for many program
>>>> >> scenarios), but will break those into 4KB mappings when needed (e.g.
>>>> cannot
>>>> >> satisfy with 2MB pages, or when it needs to swap or page out the
>>>> memory,
>>>> >> since paging is done 4KB at a time). With such a mixed approach,
>>>> some sort
>>>> >> of a "defragmenter" or "compactor" is required to exist, because
>>>> without it
>>>> >> simple fragmentation will (over time) make 2MB contiguous physical
>>>> pages a
>>>> >> rare thing, and performance will tend to degrade over time. As a
>>>> result, and
>>>> >> in order to support THP, Linux kernels will attempt to defragment
>>>> (or
>>>> >> "compact") memory and memory zones. This can be done either by
>>>> unmapping
>>>> >> pages, copying their contents to a new compacted space, and mapping
>>>> them in
>>>> >> the new location, or by potentially forcing individually mapped 4KB
>>>> pages in
>>>> >> a 2MB physical page out (via swapping or by paging them out if they
>>>> are file
>>>> >> system pages), and reclaiming the 2MB contiguous page when that is
>>>> done. 4KB
>>>> >> pages that were forced out will come back in as needed (swapped back
>>>> in on
>>>> >> demand, or paged back in on demand).
>>>> >>
>>>> >> Defragmentation/compaction with THP can happen in two places:
>>>> >>
>>>> >> 1. First, there is a background defragmenter (a process called
>>>> >> "khugepaged") that goes around and compacts 2MB physical pages by
>>>> pushing
>>>> >> their 4KB pages out when possible. This background defragger could
>>>> >> potentially cause pages to swapped out if swapping is enabled, even
>>>> with no
>>>> >> swapping pressure in place.
>>>> >>
>>>> >> 2. "Synchronous Compaction": In some cases, an on demand page fault
>>>> (e.g.
>>>> >> when first accessing a newly alloocated 4KB page created via mmap()
>>>> or
>>>> >> malloc()) could end up trying to compact memory in order to fault
>>>> into a 2MB
>>>> >> physical page instead of a 4KB page (this can be seen in the stack
>>>> trace
>>>> >> discussed in this posting, for example:
>>>> >> https://access.redhat.com/solutions/1560893). When this happens, a
>>>> single
>>>> >> 4KB allocation could end up waiting for an attempt to compact an
>>>> entire
>>>> >> "zone" of pages, even if those are compacted purely thru in-memory
>>>> moves
>>>> >> with no I/O. It can also be blocked waiting for disk I/O as seen on
>>>> some
>>>> >> stack traces in related discussions.
>>>> >>
>>>> >> More details can be found in places like this:
>>>> >> http://www.mjmwired.net/kernel/Documentation/vm/transhuge.txt
>>>> >> http://www.linux-kvm.org/wiki/images/9/9e/2010-forum-thp.pdf
>>>> >>
>>>> >> And examples for cases of avoiding thrashing by disabling THP ion
>>>> RHEL 6.2
>>>> >> are around:
>>>> >>
>>>> >>
>>>> >> http://oaktable.net/content/linux-6-transparent-huge-pages-a
>>>> nd-hadoop-workloads
>>>> >>
>>>> >>
>>>> >> http://techaticpsr.blogspot.com/2012/04/its-official-we-have
>>>> -no-love-for.html
>>>> >>
>>>> >> BOTTOM LINE: Transparent Huge Pages is a well-inteded idea that
>>>> helps
>>>> >> compact physical memory and use more optimal mappings in the kernel,
>>>> but it
>>>> >> can come with some significant (and often surprising) latency
>>>> impacts. I
>>>> >> recommend we turn it off by default in Zing installations, and it
>>>> appears
>>>> >> that many other software packages (including most DBs, and many Java
>>>> based
>>>> >> apps) recommend the same.
>>>> >
>>>> > --
>>>> > You received this message because you are subscribed to the Google
>>>> Groups
>>>> > "mechanical-sympathy" group.
>>>> > To unsubscribe from this group and stop receiving emails from it,
>>>> send an
>>>> > email to [email protected].
>>>> > For more options, visit https://groups.google.com/d/optout.
>>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: failing to understand the issues with transparent huge paging

Reply via email to