On Sunday, August 13, 2017 at 4:29:50 AM UTC-7, Alexandr Nikitin wrote:
>
> Regarding measurements: I understand that it's hard. In my case, the 
> measurements were done on production servers and production load. Servers 
> were not overloaded, they got ~40% of their capacity. Latency were gathered 
> for a few dozen minutes. Kernel (khugepaged) functions probing was done for a 
> few hours (I think).
> What I didn't measure is the maximum throughput, slow allocation and 
> compaction path mentioned by Gil, page table size and page walking time. If 
> anyone knows how to probe the kernel page walking time, then it would be 
> interesting to compare whether the page and table sizes affect it or not.
> It could be a good time to repeat the experiments. Please advice what and how 
> to measure.
>
>
Since the question being asked is "Does THP cause any long stalls on your 
system?", I'd just run your measurement over several days, and focus on the 
maximum. If you can trace stuff, trace try_to_compact_pages, since that's 
what the allocation path would be calling when the bad situations happen.
 

> The stack trace example I posted earlier represents the path that will be 
> taken if an on-demand allocation page fault on a THP-allocated region happens 
> when no free 2MB page is available in the system.
>
>
> To be honest I though that if THP fails to allocate a hugepage then it falls 
> back to regular pages. I thought that khugepaged does the compaction logic 
> (if the setting is not always turns out). I see it in docs 
> https://www.kernel.org/doc/Documentation/vm/transhuge.txt
>
> "- if a hugepage allocation fails because of memory fragmentation,
>   regular pages should be gracefully allocated instead and mixed in
>   the same vma without any failure or significant delay and without
>   userland noticing
> "
>
>
Yeh, that's what I thought it means when I first read that stuff a few 
years ago too. Unfortunately, "fails because of memory fragmentation" 
[currently, in implementation] still means "fails if fragmentation is bad 
enough that it prevents defragmentation after trying to actually 
defragment" rather than "fails because fragmentation has left no currently 
free 2MB pages". The "regular pages should be gracefully allocated instead" 
part [unfortunately] only happens after an attempt to compact pages... It 
is "graceful" in the sense that it doesn't crash, but not in the sense that 
it doesn't [potentially] stall for a very long time. The stack trace in my 
earlier post shows the path of the allocation first trying to compact (with 
try_to_compact_pages) before falling "gracefully" back to satisfying the 
fault with 4KB page allocations.

It looks like newer kernels do support a "defer" option, see separate 
posting (following this one) for that.
 

> The compaction/ defrag phase can be addressed with its own flags:
>
> /sys/kernel/mm/transparent_hugepage/defrag
> /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
> /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
>
>
The settings above only control the background khugepaged behavior. 
khugepaged runs in the background, and you can use these settings to 
control it's behavior and e.g. make it more aggressive. But your process's 
on-demand-allocation page faults (hitting on an already mmapped, but 
not-yet-modfied contiguous 2MB region that THP aims to satisfy by creating 
2MB mappings) can always occur in between the sleeps that this background 
process does, and enough such faults can occur to outrun whatever 
khugepaged has produced.

khugepaged does not evict filesystem cache and buffer pages, it only 
defragments them (hey are movable, so it will shift them around to try to 
make the free memory show up in contiguous 2MB ranges) so the amount of 
free memory it gets to play with on a long running system is often only a 
few hundreds of MB (/proc/sys/vm/min_free_kbytes can/should be increased, 
but usually folks won't set it to more than 1-2GB). Vigorous i/o paging 
activity [which generally happens in 4KB page units] will often fragment 
stuff quickly.  
 

> I'm not a kernel expert though and I may be wrong. I'm really interested if 
> those flags could solve or mitigate the freezes people mentioned here.
>
>
It looks like newer kernels do support a "defer" setting option for THP. 
That setting seems to avoid trying to compact memory in the allocation 
path. It may replace my general recommendation to set things to "never" [if 
you care about avoiding huge outliers] once I get to verify it in a few 
places.
 

>
> If that occasional outlier is something you are fine with, then turning THP 
> on for the speed benefits you may be seeing makes sense. But if you can't 
> accept the occasional ~0.5+ sec freezes, turn it off. 
>
>
> I just wanted to show for people who blindly follow advice on the Internet 
> (and there are many such suggestions) that there's an impact. It can be 
> noticeable and depends on setup and load.
>
>
And please keep doing that. Concrete results postings are useful. And the 
speed benefit you show in your application is quite compelling. My 
motivation is quite similar, but my focus here was on highlighting the "if 
you want to avoid terrible outliers" thing in answering the original 
questions at the top of this thread. I see way too many "recommendations on 
the internet" based purely on speed, which ignore the outliers and other 
degenerate thrashing behaviors that may occur (infrequently, but far too 
often for some)...
 

>
>
> On Sunday, August 13, 2017 at 10:10:01 AM UTC+3, Gil Tene wrote:
>>
>>
>>
>> On Saturday, August 12, 2017 at 3:01:31 AM UTC-7, Alexandr Nikitin wrote:
>>>
>>> I played with Transparent Hugepages some time ago and I want to share 
>>> some numbers based on real world high-load applications.
>>> We have a JVM application: high-load tcp server based on netty. No clear 
>>> bottleneck, CPU, memory and network are equally highly loaded. The amount 
>>> of work depends on request content.
>>> The following numbers are based on normal server load ~40% of maximum 
>>> number of requests one server can handle.
>>>
>>> *When THP is off:*
>>> End-to-end application latency in microseconds:
>>> "p50" : 718.891,
>>> "p95" : 4110.26,
>>> "p99" : 7503.938,
>>> "p999" : 15564.827,
>>>
>>> perf stat -e dTLB-load-misses,iTLB-load-misses -p PID -I 1000
>>> ...
>>> ...         25,164,369      iTLB-load-misses
>>> ...         81,154,170      dTLB-load-misses
>>> ...
>>>
>>> *When THP is always on:*
>>> End-to-end application latency in microseconds:
>>> "p50" : 601.196,
>>> "p95" : 3260.494,
>>> "p99" : 7104.526,
>>> "p999" : 11872.642,
>>>
>>> perf stat -e dTLB-load-misses,iTLB-load-misses -p PID -I 1000
>>> ...
>>> ...    21,400,513      dTLB-load-misses
>>> ...      4,633,644      iTLB-load-misses
>>> ...
>>>
>>> As you can see THP performance impact is measurable and too significant 
>>> to ignore. 4.1 ms vs 3.2 ms 99%% and 100M vs 25M TLB misses.
>>> I also used SytemTap to measure few kernel functions like 
>>> collapse_huge_page, clear_huge_page, split_huge_page. There were no 
>>> significant spikes using THP.
>>> AFAIR that was 3.10 kernel which is 4 years old now. I can repeat 
>>> experiments with the newer kernels if there's interest. (I don't know what 
>>> was changed there though)
>>>
>>
>> Unfortunately, just because you didn't run into a huge spike during your 
>> test doesn't mean it won't hit you in the future... The stack trace example 
>> I posted earlier represents the path that will be taken if an on-demand 
>> allocation page fault on a THP-allocated region happens when no free 2MB 
>> page is available in the system. Inducing that behavior is not that hard, 
>> e.g. just do a bunch of high volume journaling or logging, and you'll 
>> probably trigger it eventually. And when it does take that path, that will 
>> be your thread de-fragging the entire system's physical memory, one 2MB 
>> page at a time.
>>
>> And when that happens, you're probably not talking 10-20msec. More like 
>> several hundreds of msec (growing with the system physical memory size, the 
>> specific stack trace is taken from a RHEL issue that reported >22 seconds). 
>> If that occasional outlier is something you are fine with, then turning THP 
>> on for the speed benefits you may be seeing makes sense. But if you can't 
>> accept the occasional ~0.5+ sec freezes, turn it off. 
>>
>>
>>>
>>> On Monday, August 7, 2017 at 6:42:21 PM UTC+3, Peter Veentjer wrote:
>>>>
>>>> Hi Everyone,
>>>>
>>>> I'm failing to understand the problem with transparent huge pages.
>>>>
>>>> I 'understand' how normal pages work. A page is typically 4kb in a 
>>>> virtual address space; each process has its own. 
>>>>
>>>> I understand how the TLB fits in; a cache providing a mapping of 
>>>> virtual to real addresses to speed up address conversion.
>>>>
>>>> I understand that using a large page e.g. 2mb instead of a 4kb page can 
>>>> reduce pressure on the TLB.
>>>>
>>>> So till so far it looks like huge large pages makes a lot of sense; of 
>>>> course at the expensive of wasting memory if only a small section of a 
>>>> page 
>>>> is being used. 
>>>>
>>>> The first part I don't understand is: why is it called transparent huge 
>>>> pages? So what is transparent about it? 
>>>>
>>>> The second part I'm failing to understand is: why can it cause 
>>>> problems? There are quite a few applications that recommend disabling THP 
>>>> and I recently helped a customer that was helped by disabling it. It seems 
>>>> there is more going on behind the scene's than having an increased page 
>>>> size. Is it caused due to fragmentation? So if a new page is needed and 
>>>> memory is fragmented (due to smaller pages); that small-pages need to be 
>>>> compacted before a new huge page can be allocated? But if this would be 
>>>> the 
>>>> only thing; this shouldn't be a problem once all pages for the application 
>>>> have been touched and all pages are retained.
>>>>
>>>> So I'm probably missing something simple.
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to