On Sunday, August 13, 2017 at 4:29:50 AM UTC-7, Alexandr Nikitin wrote: > > Regarding measurements: I understand that it's hard. In my case, the > measurements were done on production servers and production load. Servers > were not overloaded, they got ~40% of their capacity. Latency were gathered > for a few dozen minutes. Kernel (khugepaged) functions probing was done for a > few hours (I think). > What I didn't measure is the maximum throughput, slow allocation and > compaction path mentioned by Gil, page table size and page walking time. If > anyone knows how to probe the kernel page walking time, then it would be > interesting to compare whether the page and table sizes affect it or not. > It could be a good time to repeat the experiments. Please advice what and how > to measure. > > Since the question being asked is "Does THP cause any long stalls on your system?", I'd just run your measurement over several days, and focus on the maximum. If you can trace stuff, trace try_to_compact_pages, since that's what the allocation path would be calling when the bad situations happen.
> The stack trace example I posted earlier represents the path that will be > taken if an on-demand allocation page fault on a THP-allocated region happens > when no free 2MB page is available in the system. > > > To be honest I though that if THP fails to allocate a hugepage then it falls > back to regular pages. I thought that khugepaged does the compaction logic > (if the setting is not always turns out). I see it in docs > https://www.kernel.org/doc/Documentation/vm/transhuge.txt > > "- if a hugepage allocation fails because of memory fragmentation, > regular pages should be gracefully allocated instead and mixed in > the same vma without any failure or significant delay and without > userland noticing > " > > Yeh, that's what I thought it means when I first read that stuff a few years ago too. Unfortunately, "fails because of memory fragmentation" [currently, in implementation] still means "fails if fragmentation is bad enough that it prevents defragmentation after trying to actually defragment" rather than "fails because fragmentation has left no currently free 2MB pages". The "regular pages should be gracefully allocated instead" part [unfortunately] only happens after an attempt to compact pages... It is "graceful" in the sense that it doesn't crash, but not in the sense that it doesn't [potentially] stall for a very long time. The stack trace in my earlier post shows the path of the allocation first trying to compact (with try_to_compact_pages) before falling "gracefully" back to satisfying the fault with 4KB page allocations. It looks like newer kernels do support a "defer" option, see separate posting (following this one) for that. > The compaction/ defrag phase can be addressed with its own flags: > > /sys/kernel/mm/transparent_hugepage/defrag > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs > > The settings above only control the background khugepaged behavior. khugepaged runs in the background, and you can use these settings to control it's behavior and e.g. make it more aggressive. But your process's on-demand-allocation page faults (hitting on an already mmapped, but not-yet-modfied contiguous 2MB region that THP aims to satisfy by creating 2MB mappings) can always occur in between the sleeps that this background process does, and enough such faults can occur to outrun whatever khugepaged has produced. khugepaged does not evict filesystem cache and buffer pages, it only defragments them (hey are movable, so it will shift them around to try to make the free memory show up in contiguous 2MB ranges) so the amount of free memory it gets to play with on a long running system is often only a few hundreds of MB (/proc/sys/vm/min_free_kbytes can/should be increased, but usually folks won't set it to more than 1-2GB). Vigorous i/o paging activity [which generally happens in 4KB page units] will often fragment stuff quickly. > I'm not a kernel expert though and I may be wrong. I'm really interested if > those flags could solve or mitigate the freezes people mentioned here. > > It looks like newer kernels do support a "defer" setting option for THP. That setting seems to avoid trying to compact memory in the allocation path. It may replace my general recommendation to set things to "never" [if you care about avoiding huge outliers] once I get to verify it in a few places. > > If that occasional outlier is something you are fine with, then turning THP > on for the speed benefits you may be seeing makes sense. But if you can't > accept the occasional ~0.5+ sec freezes, turn it off. > > > I just wanted to show for people who blindly follow advice on the Internet > (and there are many such suggestions) that there's an impact. It can be > noticeable and depends on setup and load. > > And please keep doing that. Concrete results postings are useful. And the speed benefit you show in your application is quite compelling. My motivation is quite similar, but my focus here was on highlighting the "if you want to avoid terrible outliers" thing in answering the original questions at the top of this thread. I see way too many "recommendations on the internet" based purely on speed, which ignore the outliers and other degenerate thrashing behaviors that may occur (infrequently, but far too often for some)... > > > On Sunday, August 13, 2017 at 10:10:01 AM UTC+3, Gil Tene wrote: >> >> >> >> On Saturday, August 12, 2017 at 3:01:31 AM UTC-7, Alexandr Nikitin wrote: >>> >>> I played with Transparent Hugepages some time ago and I want to share >>> some numbers based on real world high-load applications. >>> We have a JVM application: high-load tcp server based on netty. No clear >>> bottleneck, CPU, memory and network are equally highly loaded. The amount >>> of work depends on request content. >>> The following numbers are based on normal server load ~40% of maximum >>> number of requests one server can handle. >>> >>> *When THP is off:* >>> End-to-end application latency in microseconds: >>> "p50" : 718.891, >>> "p95" : 4110.26, >>> "p99" : 7503.938, >>> "p999" : 15564.827, >>> >>> perf stat -e dTLB-load-misses,iTLB-load-misses -p PID -I 1000 >>> ... >>> ... 25,164,369 iTLB-load-misses >>> ... 81,154,170 dTLB-load-misses >>> ... >>> >>> *When THP is always on:* >>> End-to-end application latency in microseconds: >>> "p50" : 601.196, >>> "p95" : 3260.494, >>> "p99" : 7104.526, >>> "p999" : 11872.642, >>> >>> perf stat -e dTLB-load-misses,iTLB-load-misses -p PID -I 1000 >>> ... >>> ... 21,400,513 dTLB-load-misses >>> ... 4,633,644 iTLB-load-misses >>> ... >>> >>> As you can see THP performance impact is measurable and too significant >>> to ignore. 4.1 ms vs 3.2 ms 99%% and 100M vs 25M TLB misses. >>> I also used SytemTap to measure few kernel functions like >>> collapse_huge_page, clear_huge_page, split_huge_page. There were no >>> significant spikes using THP. >>> AFAIR that was 3.10 kernel which is 4 years old now. I can repeat >>> experiments with the newer kernels if there's interest. (I don't know what >>> was changed there though) >>> >> >> Unfortunately, just because you didn't run into a huge spike during your >> test doesn't mean it won't hit you in the future... The stack trace example >> I posted earlier represents the path that will be taken if an on-demand >> allocation page fault on a THP-allocated region happens when no free 2MB >> page is available in the system. Inducing that behavior is not that hard, >> e.g. just do a bunch of high volume journaling or logging, and you'll >> probably trigger it eventually. And when it does take that path, that will >> be your thread de-fragging the entire system's physical memory, one 2MB >> page at a time. >> >> And when that happens, you're probably not talking 10-20msec. More like >> several hundreds of msec (growing with the system physical memory size, the >> specific stack trace is taken from a RHEL issue that reported >22 seconds). >> If that occasional outlier is something you are fine with, then turning THP >> on for the speed benefits you may be seeing makes sense. But if you can't >> accept the occasional ~0.5+ sec freezes, turn it off. >> >> >>> >>> On Monday, August 7, 2017 at 6:42:21 PM UTC+3, Peter Veentjer wrote: >>>> >>>> Hi Everyone, >>>> >>>> I'm failing to understand the problem with transparent huge pages. >>>> >>>> I 'understand' how normal pages work. A page is typically 4kb in a >>>> virtual address space; each process has its own. >>>> >>>> I understand how the TLB fits in; a cache providing a mapping of >>>> virtual to real addresses to speed up address conversion. >>>> >>>> I understand that using a large page e.g. 2mb instead of a 4kb page can >>>> reduce pressure on the TLB. >>>> >>>> So till so far it looks like huge large pages makes a lot of sense; of >>>> course at the expensive of wasting memory if only a small section of a >>>> page >>>> is being used. >>>> >>>> The first part I don't understand is: why is it called transparent huge >>>> pages? So what is transparent about it? >>>> >>>> The second part I'm failing to understand is: why can it cause >>>> problems? There are quite a few applications that recommend disabling THP >>>> and I recently helped a customer that was helped by disabling it. It seems >>>> there is more going on behind the scene's than having an increased page >>>> size. Is it caused due to fragmentation? So if a new page is needed and >>>> memory is fragmented (due to smaller pages); that small-pages need to be >>>> compacted before a new huge page can be allocated? But if this would be >>>> the >>>> only thing; this shouldn't be a problem once all pages for the application >>>> have been touched and all pages are retained. >>>> >>>> So I'm probably missing something simple. >>>> >>>> -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
