To highlight the problematic path in current THP kernel implementations, 
here is an example call trace that can happen (pulled from the discussion 
linked to below). It shows that a simple on-demand page fault in regular 
anonymous memory (e.g. when a normal malloc call is made and manipulates a 
malloc-mananged 2MB area, or when such the resulting malloc'ed struct is 
written to) can end up compacting an entire zone (which can be the vast 
majority of system memory) in a single call, using the faulting thread. The 
specific example stack trace is taken from a situation where that fault 
took so long (on a NUMA system) that a soft0lockup was triggered, showing 
the call took longer than 22 seconds (!!!). But even without the NUMA or 
migrate_pages aspects, compassion of a single zone can take 100s of msec or 
more.

Browsing through the current kernel code 
(e.g. 
http://elixir.free-electrons.com/linux/latest/source/mm/compaction.c#L1722) 
seems to show that this is still the likely path that would be taken when 
no free 2MB pages are found in current kernels :-(

And this situation will naturally occur under all sorts of common timing g 
conditions (i/o fragmenting free memory to 4KB (but no 2M), background 
compaction/defrag falls behind during some heavy kernel-driven i/o spike, 
and some unlucky thread doing a malloc when the 2MB physical free list 
exhausted)


kernel: Call Trace:
kernel: [<ffffffff81179d8f>] compaction_alloc+0x1cf/0x240
kernel: [<ffffffff811b15ce>] migrate_pages+0xce/0x610
kernel: [<ffffffff81179bc0>] ? isolate_freepages_block+0x380/0x380
kernel: [<ffffffff8117abb9>] compact_zone+0x299/0x400
kernel: [<ffffffff8117adbc>] compact_zone_order+0x9c/0xf0
kernel: [<ffffffff8117b171>] try_to_compact_pages+0x121/0x1a0
kernel: [<ffffffff815ff336>] __alloc_pages_direct_compact+0xac/0x196
kernel: [<ffffffff81160758>] __alloc_pages_nodemask+0x788/0xb90
kernel: [<ffffffff810b11c0>] ? task_numa_fault+0x8d0/0xbb0
kernel: [<ffffffff811a24aa>] alloc_pages_vma+0x9a/0x140
kernel: [<ffffffff811b674b>] do_huge_pmd_anonymous_page+0x10b/0x410
kernel: [<ffffffff81182334>] handle_mm_fault+0x184/0xd60
kernel: [<ffffffff8160f1e6>] __do_page_fault+0x156/0x520
kernel: [<ffffffff8118a945>] ? change_protection+0x65/0xa0
kernel: [<ffffffff811a0dbb>] ? change_prot_numa+0x1b/0x40
kernel: [<ffffffff810adb86>] ? task_numa_work+0x266/0x300
kernel: [<ffffffff8160f5ca>] do_page_fault+0x1a/0x70
kernel: [<ffffffff81013b0c>] ? do_notify_resume+0x9c/0xb0
kernel: [<ffffffff8160b808>] page_fault+0x28/0x30


On Monday, August 7, 2017 at 10:25:50 AM UTC-7, Gil Tene wrote:
>
> THP certainly sits in my "just don't do it" list of tuning things due to 
> it's fundamental dramatic latency disruption in current implementations, 
> seen as occasional 10s to 100s of msec (and sometimes even 1sec+) stalls on 
> something as simple and common as a 32 byte malloc. THP is a form of 
> in-kernel GC. And the current THP implementation involves potential and 
> occasional synchronous, stop-the-world compaction done at allocation-time, 
> on or by any application thread that does an mmap or a malloc.
>
> I dug up an e-mail I wrote on the subject (to a recipient on this list) 
> back in Jan 2013 [see below]. While it has some specific links (including a 
> stack trace showing the kernel de-fragging the whole system on a single 
> mmap call), note that this material is now 4.5 years old, and things 
> *might* have changed or improved to some degree. While I've seen no recent 
> first-hand evidence of efforts to improve things on the 
> don't-dramatically-stall-malloc (or other mappings) front, I haven't been 
> following it very closely (I just wrote it off as "lets check again in 5 
> years"). If someone else here knows of some actual improvements to this 
> picture in recent years, or of efforts or discussions in the Linux Kernel 
> community on this subject, please point to them.
>
> IMO, the notion of THP is not flawed. The implementation is. And I believe 
> that the implementation of THP *can* be improved to be much more robust and 
> to avoid forcing occasional huge latency artifacts on memory-allocating 
> threads:
>
> 1. the first (huge) step in improving thing would be to 
> never-ever-ever-ever have a mapping thread spend any time performing any 
> kind of defragmentation, and to simply accept 4KB mappings when no 2MB 
> physical pages are available. Let background defragmentation do all the 
> work (including converting 4KLB-allocated-but-2MB-contiguous ranges to 2MB 
> mappings).
>
> 2. The second level (much needed, but at an order of magnitude of 10s of 
> milliseconds rather than the current 100s of msec or more) would be to make 
> background defragmentation work without stalling foreground access to a 
> currently-being-defragmented 2MB region. I.e. don't stall access for the 
> duration of a 2MB defrag operation (which can take several msec).
>
> While both of these are needed for a "don't worry about it" mode of use 
> (which something called "transparent": really should aim for), #1 is a much 
> easier step than #2 is. Without it, THP can cause application pauses (to 
> any linux app) that are often worse than e.g. HotSpot Java GC pauses. Which 
> is ironic.
>
> -- Gil. 
>
> -------------------------------
>
> The problem is not the background defrag operation. The problem is 
> synchronous defragging done on allocation, where THP on means a 2MB 
> allocation will attempt to allocate a 2MB contiguous page, and if it can't 
> find one, it may end up defragging an entire zone before the allocation 
> completes. The /sys/kernel/mm/transparent_hugepage/defrag setting only 
> controls the background...
>
> Here is something I wrote up on it internally after much investigation:
>
> Transparent huge pages (THP) is a feature Red Hat championed and 
> introduced in RHEL 6.x, and got into the upstream kernel around the 
>  ~2.6.38 time, it generally exists in all Linux 3.x kernels and beyond (so 
> it exists in both SLES 11 SP@ and in Ubuntu 12.04 LTS). With transparent 
> huge pages, the kernel *attempts* to use 2MB page mappings to map 
> contiguous and aligned memory ranges of that size (which are quite common 
> for many program scenarios), but will break those into 4KB mappings when 
> needed (e.g. cannot satisfy with 2MB pages, or when it needs to swap or 
> page out the memory, since paging is done 4KB at a time). With such a mixed 
> approach, some sort of a "defragmenter" or "compactor" is required to 
> exist, because without it simple fragmentation will (over time) make 2MB 
> contiguous physical pages a rare thing, and performance will tend to 
> degrade over time. As a result, and in order to support THP, Linux kernels 
> will attempt to defragment (or "compact") memory and memory zones. This can 
> be done either by unmapping pages, copying their contents to a new 
> compacted space, and mapping them in the new location, or by potentially 
> forcing individually mapped 4KB pages in a 2MB physical page out (via 
> swapping or by paging them out if they are file system pages), and 
> reclaiming the 2MB contiguous page when that is done. 4KB pages that were 
> forced out will come back in as needed (swapped back in on demand, or paged 
> back in on demand).
>
> Defragmentation/compaction with THP can happen in two places:
>
> 1. First, there is a background defragmenter (a process called 
> "khugepaged") that goes around and compacts 2MB physical pages by pushing 
> their 4KB pages out when possible. This background defragger could 
> potentially cause pages to swapped out if swapping is enabled, even with no 
> swapping pressure in place.
>
> 2. "Synchronous Compaction": In some cases, an on demand page fault (e.g. 
> when first accessing a newly alloocated 4KB page created via mmap() or 
> malloc()) could end up trying to compact memory in order to fault into a 
> 2MB physical page instead of a 4KB page (this can be seen in the stack 
> trace discussed in this posting, for 
> example: https://access.redhat.com/solutions/1560893). When this happens, a 
> single 4KB allocation could end up waiting for an attempt to compact an 
> entire "zone" of pages, even if those are compacted purely thru in-memory 
> moves with no I/O. It can also be blocked waiting for disk I/O as seen on 
> some stack traces in related discussions.
>
> More details can be found in places like this:
> http://www.mjmwired.net/kernel/Documentation/vm/transhuge.txt
> http://www.linux-kvm.org/wiki/images/9/9e/2010-forum-thp.pdf
>
> And examples for cases of avoiding thrashing by disabling THP ion RHEL 6.2 
> are around:
>
>
> http://oaktable.net/content/linux-6-transparent-huge-pages-and-hadoop-workloads
>
>
> http://techaticpsr.blogspot.com/2012/04/its-official-we-have-no-love-for.html
>
> *BOTTOM LINE: Transparent Huge Pages is a well-inteded idea that helps 
> compact physical memory and use more optimal mappings in the kernel, but it 
> can come with some significant (and often surprising) latency impacts. I 
> recommend we turn it off by default in Zing installations, and it appears 
> that many other software packages (including most **DBs, and many Java 
> based apps) recommend the same.*
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to