Re: failing to understand the issues with transparent huge paging

Gil Tene Mon, 07 Aug 2017 10:26:11 -0700

THP certainly sits in my "just don't do it" list of tuning things due to 
it's fundamental dramatic latency disruption in current implementations, 
seen as occasional 10s to 100s of msec (and sometimes even 1sec+) stalls on 
something as simple and common as a 32 byte malloc. THP is a form of 
in-kernel GC. And the current THP implementation involves potential and 
occasional synchronous, stop-the-world compaction done at allocation-time, 
on or by any application thread that does an mmap or a malloc.

I dug up an e-mail I wrote on the subject (to a recipient on this list)
back in Jan 2013. While it has some specific links (including a stack trace
showing the kernel de-fragging the whole system on a single mmap call),
note that this material is now 4.5 years old, and things *might* have
changed or improved to some degree. While I've seen no recent first-hand
evidence of efforts to improve things on the
don't-dramatically-stall-malloc (or other mappings) front, I haven't been
following it very closely (I just wrote it off as "lets check again in 5
years"). If someone else here knows of some actual improvements to this
picture in recent years, or of efforts or discussions in the Linux Kernel
community on this subject, please point to them.

IMO, the notion of THP is not flawed. The implementation is. And I believe
that the implementation of THP *can* be improved to be much more robust and
to avoid forcing occasional huge latency artifacts on memory-allocating
threads:

1. the first (huge) step in improving thing would be to
never-ever-ever-ever have a mapping thread spend any time performing any
kind of defragmentation, and to simply accept 4KB mappings when no 2MB
physical pages are available. Let background defragmentation do all the
work (including converting 4KLB-allocated-but-2MB-contiguous ranges to 2MB
mappings).

2. The second level (much needed, but at an order of magnitude of 10s of
milliseconds rather than the current 100s of msec or more) would be to make
background defragmentation work without stalling foreground access to a
currently-being-defragmented 2MB region. I.e. don't stall access for the
duration of a 2MB defrag operation (which can take several msec).

While both of these are needed for a "don't worry about it" mode of use
(which something called "transparent": really should aim for), #1 is a much
easier step than #2 is. Without it, THP can cause application pauses (to
any linux app) that are often worse than e.g. HotSpot Java GC pauses. Which
is ironic.

-- Gil.

-------------------------------

The problem is not the background defrag operation. The problem is
synchronous defragging done on allocation, where THP on means a 2MB
allocation will attempt to allocate a 2MB contiguous page, and if it can't
find one, it may end up defragging an entire zone before the allocation
completes. The /sys/kernel/mm/transparent_hugepage/defrag setting only
controls the background...

Here is something I wrote up on it internally after much investigation:

Transparent huge pages (THP) is a feature Red Hat championed and introduced
in RHEL 6.x, and got into the upstream kernel around the ~2.6.38 time, it
generally exists in all Linux 3.x kernels and beyond (so it exists in both
SLES 11 SP@ and in Ubuntu 12.04 LTS). With transparent huge pages, the
kernel *attempts* to use 2MB page mappings to map contiguous and aligned
memory ranges of that size (which are quite common for many program
scenarios), but will break those into 4KB mappings when needed (e.g. cannot
satisfy with 2MB pages, or when it needs to swap or page out the memory,
since paging is done 4KB at a time). With such a mixed approach, some sort
of a "defragmenter" or "compactor" is required to exist, because without it
simple fragmentation will (over time) make 2MB contiguous physical pages a
rare thing, and performance will tend to degrade over time. As a result,
and in order to support THP, Linux kernels will attempt to defragment (or
"compact") memory and memory zones. This can be done either by unmapping
pages, copying their contents to a new compacted space, and mapping them in
the new location, or by potentially forcing individually mapped 4KB pages
in a 2MB physical page out (via swapping or by paging them out if they are
file system pages), and reclaiming the 2MB contiguous page when that is
done. 4KB pages that were forced out will come back in as needed (swapped
back in on demand, or paged back in on demand).

Defragmentation/compaction with THP can happen in two places:

1. First, there is a background defragmenter (a process called
"khugepaged") that goes around and compacts 2MB physical pages by pushing
their 4KB pages out when possible. This background defragger could
potentially cause pages to swapped out if swapping is enabled, even with no
swapping pressure in place.

2. "Synchronous Compaction": In some cases, an on demand page fault (e.g.
when first accessing a newly alloocated 4KB page created via mmap() or
malloc()) could end up trying to compact memory in order to fault into a
2MB physical page instead of a 4KB page (this can be seen in the stack
trace discussed in this posting, for example:
http://us.generation-nt.com/answer/patch-mm-thp-disable-defrag-page-faults-per-default-help-204214761.html).

When this happens, a single 4KB allocation could end up waiting for an
attempt to compact an entire "zone" of pages, even if those are compacted
purely thru in-memory moves with no I/O. It can also be blocked waiting for
disk I/O as seen on some stack traces in related discussions.

More details can be found in places like this:
http://www.mjmwired.net/kernel/Documentation/vm/transhuge.txt
http://www.linux-kvm.org/wiki/images/9/9e/2010-forum-thp.pdf

And examples for cases of avoiding thrashing by disabling THP ion RHEL 6.2
are around:

http://oaktable.net/content/linux-6-transparent-huge-pages-and-hadoop-workloads

http://techaticpsr.blogspot.com/2012/04/its-official-we-have-no-love-for.html

*BOTTOM LINE: Transparent Huge Pages is a well-inteded idea that helps
compact physical memory and use more optimal mappings in the kernel, but it
can come with some significant (and often surprising) latency impacts. I
recommend we turn it off by default in Zing installations, and it appears
that many other software packages (including most **DBs, and many Java
based apps) recommend the same.*

--
You received this message because you are subscribed to the Google Groups
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: failing to understand the issues with transparent huge paging

Reply via email to