THP certainly sits in my "just don't do it" list of tuning things due to it's fundamental dramatic latency disruption in current implementations, seen as occasional 10s to 100s of msec (and sometimes even 1sec+) stalls on something as simple and common as a 32 byte malloc. THP is a form of in-kernel GC. And the current THP implementation involves potential and occasional synchronous, stop-the-world compaction done at allocation-time, on or by any application thread that does an mmap or a malloc.
I dug up an e-mail I wrote on the subject (to a recipient on this list) back in Jan 2013. While it has some specific links (including a stack trace showing the kernel de-fragging the whole system on a single mmap call), note that this material is now 4.5 years old, and things *might* have changed or improved to some degree. While I've seen no recent first-hand evidence of efforts to improve things on the don't-dramatically-stall-malloc (or other mappings) front, I haven't been following it very closely (I just wrote it off as "lets check again in 5 years"). If someone else here knows of some actual improvements to this picture in recent years, or of efforts or discussions in the Linux Kernel community on this subject, please point to them. IMO, the notion of THP is not flawed. The implementation is. And I believe that the implementation of THP *can* be improved to be much more robust and to avoid forcing occasional huge latency artifacts on memory-allocating threads: 1. the first (huge) step in improving thing would be to never-ever-ever-ever have a mapping thread spend any time performing any kind of defragmentation, and to simply accept 4KB mappings when no 2MB physical pages are available. Let background defragmentation do all the work (including converting 4KLB-allocated-but-2MB-contiguous ranges to 2MB mappings). 2. The second level (much needed, but at an order of magnitude of 10s of milliseconds rather than the current 100s of msec or more) would be to make background defragmentation work without stalling foreground access to a currently-being-defragmented 2MB region. I.e. don't stall access for the duration of a 2MB defrag operation (which can take several msec). While both of these are needed for a "don't worry about it" mode of use (which something called "transparent": really should aim for), #1 is a much easier step than #2 is. Without it, THP can cause application pauses (to any linux app) that are often worse than e.g. HotSpot Java GC pauses. Which is ironic. -- Gil. ------------------------------- The problem is not the background defrag operation. The problem is synchronous defragging done on allocation, where THP on means a 2MB allocation will attempt to allocate a 2MB contiguous page, and if it can't find one, it may end up defragging an entire zone before the allocation completes. The /sys/kernel/mm/transparent_hugepage/defrag setting only controls the background... Here is something I wrote up on it internally after much investigation: Transparent huge pages (THP) is a feature Red Hat championed and introduced in RHEL 6.x, and got into the upstream kernel around the ~2.6.38 time, it generally exists in all Linux 3.x kernels and beyond (so it exists in both SLES 11 SP@ and in Ubuntu 12.04 LTS). With transparent huge pages, the kernel *attempts* to use 2MB page mappings to map contiguous and aligned memory ranges of that size (which are quite common for many program scenarios), but will break those into 4KB mappings when needed (e.g. cannot satisfy with 2MB pages, or when it needs to swap or page out the memory, since paging is done 4KB at a time). With such a mixed approach, some sort of a "defragmenter" or "compactor" is required to exist, because without it simple fragmentation will (over time) make 2MB contiguous physical pages a rare thing, and performance will tend to degrade over time. As a result, and in order to support THP, Linux kernels will attempt to defragment (or "compact") memory and memory zones. This can be done either by unmapping pages, copying their contents to a new compacted space, and mapping them in the new location, or by potentially forcing individually mapped 4KB pages in a 2MB physical page out (via swapping or by paging them out if they are file system pages), and reclaiming the 2MB contiguous page when that is done. 4KB pages that were forced out will come back in as needed (swapped back in on demand, or paged back in on demand). Defragmentation/compaction with THP can happen in two places: 1. First, there is a background defragmenter (a process called "khugepaged") that goes around and compacts 2MB physical pages by pushing their 4KB pages out when possible. This background defragger could potentially cause pages to swapped out if swapping is enabled, even with no swapping pressure in place. 2. "Synchronous Compaction": In some cases, an on demand page fault (e.g. when first accessing a newly alloocated 4KB page created via mmap() or malloc()) could end up trying to compact memory in order to fault into a 2MB physical page instead of a 4KB page (this can be seen in the stack trace discussed in this posting, for example: http://us.generation-nt.com/answer/patch-mm-thp-disable-defrag-page-faults-per-default-help-204214761.html). When this happens, a single 4KB allocation could end up waiting for an attempt to compact an entire "zone" of pages, even if those are compacted purely thru in-memory moves with no I/O. It can also be blocked waiting for disk I/O as seen on some stack traces in related discussions. More details can be found in places like this: http://www.mjmwired.net/kernel/Documentation/vm/transhuge.txt http://www.linux-kvm.org/wiki/images/9/9e/2010-forum-thp.pdf And examples for cases of avoiding thrashing by disabling THP ion RHEL 6.2 are around: http://oaktable.net/content/linux-6-transparent-huge-pages-and-hadoop-workloads http://techaticpsr.blogspot.com/2012/04/its-official-we-have-no-love-for.html *BOTTOM LINE: Transparent Huge Pages is a well-inteded idea that helps compact physical memory and use more optimal mappings in the kernel, but it can come with some significant (and often surprising) latency impacts. I recommend we turn it off by default in Zing installations, and it appears that many other software packages (including most **DBs, and many Java based apps) recommend the same.* -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.