THP certainly sits in my "just don't do it" list of tuning things due to 
it's fundamental dramatic latency disruption in current implementations, 
seen as occasional 10s to 100s of msec (and sometimes even 1sec+) stalls on 
something as simple and common as a 32 byte malloc. THP is a form of 
in-kernel GC. And the current THP implementation involves potential and 
occasional synchronous, stop-the-world compaction done at allocation-time, 
on or by any application thread that does an mmap or a malloc.

I dug up an e-mail I wrote on the subject (to a recipient on this list) 
back in Jan 2013. While it has some specific links (including a stack trace 
showing the kernel de-fragging the whole system on a single mmap call), 
note that this material is now 4.5 years old, and things *might* have 
changed or improved to some degree. While I've seen no recent first-hand 
evidence of efforts to improve things on the 
don't-dramatically-stall-malloc (or other mappings) front, I haven't been 
following it very closely (I just wrote it off as "lets check again in 5 
years"). If someone else here knows of some actual improvements to this 
picture in recent years, or of efforts or discussions in the Linux Kernel 
community on this subject, please point to them.

IMO, the notion of THP is not flawed. The implementation is. And I believe 
that the implementation of THP *can* be improved to be much more robust and 
to avoid forcing occasional huge latency artifacts on memory-allocating 
threads:

1. the first (huge) step in improving thing would be to 
never-ever-ever-ever have a mapping thread spend any time performing any 
kind of defragmentation, and to simply accept 4KB mappings when no 2MB 
physical pages are available. Let background defragmentation do all the 
work (including converting 4KLB-allocated-but-2MB-contiguous ranges to 2MB 
mappings).

2. The second level (much needed, but at an order of magnitude of 10s of 
milliseconds rather than the current 100s of msec or more) would be to make 
background defragmentation work without stalling foreground access to a 
currently-being-defragmented 2MB region. I.e. don't stall access for the 
duration of a 2MB defrag operation (which can take several msec).

While both of these are needed for a "don't worry about it" mode of use 
(which something called "transparent": really should aim for), #1 is a much 
easier step than #2 is. Without it, THP can cause application pauses (to 
any linux app) that are often worse than e.g. HotSpot Java GC pauses. Which 
is ironic.

-- Gil. 

-------------------------------

The problem is not the background defrag operation. The problem is 
synchronous defragging done on allocation, where THP on means a 2MB 
allocation will attempt to allocate a 2MB contiguous page, and if it can't 
find one, it may end up defragging an entire zone before the allocation 
completes. The /sys/kernel/mm/transparent_hugepage/defrag setting only 
controls the background...

Here is something I wrote up on it internally after much investigation:

Transparent huge pages (THP) is a feature Red Hat championed and introduced 
in RHEL 6.x, and got into the upstream kernel around the  ~2.6.38 time, it 
generally exists in all Linux 3.x kernels and beyond (so it exists in both 
SLES 11 SP@ and in Ubuntu 12.04 LTS). With transparent huge pages, the 
kernel *attempts* to use 2MB page mappings to map contiguous and aligned 
memory ranges of that size (which are quite common for many program 
scenarios), but will break those into 4KB mappings when needed (e.g. cannot 
satisfy with 2MB pages, or when it needs to swap or page out the memory, 
since paging is done 4KB at a time). With such a mixed approach, some sort 
of a "defragmenter" or "compactor" is required to exist, because without it 
simple fragmentation will (over time) make 2MB contiguous physical pages a 
rare thing, and performance will tend to degrade over time. As a result, 
and in order to support THP, Linux kernels will attempt to defragment (or 
"compact") memory and memory zones. This can be done either by unmapping 
pages, copying their contents to a new compacted space, and mapping them in 
the new location, or by potentially forcing individually mapped 4KB pages 
in a 2MB physical page out (via swapping or by paging them out if they are 
file system pages), and reclaiming the 2MB contiguous page when that is 
done. 4KB pages that were forced out will come back in as needed (swapped 
back in on demand, or paged back in on demand).

Defragmentation/compaction with THP can happen in two places:

1. First, there is a background defragmenter (a process called 
"khugepaged") that goes around and compacts 2MB physical pages by pushing 
their 4KB pages out when possible. This background defragger could 
potentially cause pages to swapped out if swapping is enabled, even with no 
swapping pressure in place.

2. "Synchronous Compaction": In some cases, an on demand page fault (e.g. 
when first accessing a newly alloocated 4KB page created via mmap() or 
malloc()) could end up trying to compact memory in order to fault into a 
2MB physical page instead of a 4KB page (this can be seen in the stack 
trace discussed in this posting, for example: 
http://us.generation-nt.com/answer/patch-mm-thp-disable-defrag-page-faults-per-default-help-204214761.html).
 
When this happens, a single 4KB allocation could end up waiting for an 
attempt to compact an entire "zone" of pages, even if those are compacted 
purely thru in-memory moves with no I/O. It can also be blocked waiting for 
disk I/O as seen on some stack traces in related discussions.

More details can be found in places like this:
http://www.mjmwired.net/kernel/Documentation/vm/transhuge.txt
http://www.linux-kvm.org/wiki/images/9/9e/2010-forum-thp.pdf

And examples for cases of avoiding thrashing by disabling THP ion RHEL 6.2 
are around:

http://oaktable.net/content/linux-6-transparent-huge-pages-and-hadoop-workloads

http://techaticpsr.blogspot.com/2012/04/its-official-we-have-no-love-for.html

*BOTTOM LINE: Transparent Huge Pages is a well-inteded idea that helps 
compact physical memory and use more optimal mappings in the kernel, but it 
can come with some significant (and often surprising) latency impacts. I 
recommend we turn it off by default in Zing installations, and it appears 
that many other software packages (including most **DBs, and many Java 
based apps) recommend the same.*

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to