Some points:

Those of us working in large corporate settings are likely to be running 
close to vanilla RHEL 7.3 or 6.9 with kernel versions 3.10.0-514 or 2.6.32-696 
respectively.

 I have seen the THP issue first hand in a dramatic fashion. One Java 
trading application I supported ran with heaps that ranged from 32GB to 
64GB, 
running on Azul Zing, with no appreciable GC pauses. It was migrated from 
Westmere hardware on RHEL 5.6 to (faster) Ivy Bridge hardware on RHEL 6.4. 
In non-production environments only, the application suddenly began showing 
occasional pauses of upto a few seconds. Occasional meaning only 
four or five out of 30 instances showed a pause, and they might only have 
one or two or three pauses in a day. These instances ran a workload that 
replicated a production workload.  I noticed that the only difference 
between these hosts and the healthy production hosts was that, due to human 
error,
THP was disabled on the production hosts but not the non-prod hosts. As 
soon as we disabled THP on the non-prod hosts the pauses disappeared.

This was a reactive discovery - I haven't done any proactive investigation 
of the effects of THP. This was sufficient for me to rule it out for today.




On Sunday, August 20, 2017 at 10:32:45 AM UTC-4, Alexandr Nikitin wrote:
>
> Thank you for the feedback! Appreciate it. Yes, you are right. The 
> intention was not to show that THP is an awesome feature but to share 
> techniques to measure and control risks. I made the changes  
> <https://github.com/alexandrnikitin/blog/compare/2139c405f0c50a3ab907fb2530421bf352caa412...3e58094386b14d19e06752d9faa0435be2cbe651>to
>  
> highlight the purpose and risks.
>
> The experiment is indeed interesting. I believe the "defer" option should 
> help in that environment. I'm really keen to try the latest kernel (related 
> not only to THP).
>
> *Frankly, I still don't have strong opinion about huge latency spikes in 
> allocation path in general. I'm not sure whether it's a THP issue or 
> application/environment itself. Likely it's high memory pressure in general 
> that causes spikes. Or the root of the issues is in something else, e.g. 
> the jemalloc case.*
>
>
> On Friday, August 18, 2017 at 6:32:40 PM UTC+3, Gil Tene wrote:
>>
>> This is very well written and quite detailed. It has all the makings of a 
>> great post I'd point people to. However, as currently stated, I'd worry 
>> that it would (mis)lead readers into using THP with "always" 
>> /sys/kernel/mm/transparent_hugepage/defrag settings (instead of 
>> "defer"), and/or on older (pre-4.6) kernels with a false sense that the 
>> many-msec slow path allocation latency problems many people warn about 
>> don't actually exist. You do link to the discussions on the subject, but 
>> the measurements and summary conclusion of the posting alone would not end 
>> up warning people who don't actually follow those links.
>>
>> I assume your intention is not to have the reader conclude that "there is 
>> lots of advise out there telling you to turn off THP, and it is wrong. 
>> Turning it on is perfectly safe, and may significantly speed up your 
>> application", but are instead are aiming for something like "THP used to be 
>> problematic enough to cause wide ranging recommendations to simply turn it 
>> off, but this has changed with recent Linux kernels. It is now safe to use 
>> in widely applicable ways (will th the right settings) and can really help 
>> application performance without risking huge stalls". Unfortunately, I 
>> think that many readers would understand the current text as the former, 
>> not the latter.
>>
>> Here is what I'd change to improve on the current text:
>>
>> 1. Highlight the risk of high slow path allocation latencies with the 
>> "always" (and even "madvise") setting in
>>  /sys/kernel/mm/transparent_hugepage/defrag, the fact that the "defer" 
>> option is intended to address those risks, and this defer option is 
>> available with Linux kernel versions 4.6 or later.
>>
>> 2. Create an environment that would actually demonstrate these very high 
>> (many msec or worse) latencies in the allocation slow path with defrag set 
>> to "always". This is the part that will probably take some extra work, but 
>> it will also be a very valuable contribution. The issues are so widely 
>> reported (into the 100s of msec or more, and with a wide verity of 
>> workloads as your links show) that intentional reproduction *should* be 
>> possible. And being able to demonstrate it actually happening will also 
>> allow you to demonstrate how newer kernels address it with the defer 
>> setting.
>>
>> 3. Show how changing the defrag setting to "defer" removes the high 
>> latencies seen by the allocation slow path under the same conditions.
>>
>> For (2) above, I'd look to induce a situation where the allocation slow 
>> path can't find a free 2MB page without having to defragment one directly. 
>> E.g.
>> - I'd start by significantly slowing down the background defragmentation 
>> in khugepaged (e.g set 
>> /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs 
>> to 3600000). I'd avoid turning it off completely in order to make sure you 
>> are still measuring the system in a configuration that believes it does 
>> background defragmentation.
>> - I'd add some static physical memory pressure (e.g. allocate and touch a 
>> bunch of anonymous memory in a process that would just sit on it) such that 
>> the system would only have 2-3GB free for buffers and your netty workload's 
>> heap. A sleeping jvm launched with an empirically sized and big enough -Xmx 
>> and -Xms and with AlwaysPretouch on is an easy way to do that.
>> - I'd then create an intentional and spiky fragmentation load (e.g. 
>> perform spikes of a scanning through a 20GB file every minute or so).
>> - with all that in place, I'd then repeatedly launch and run your Netty 
>> workload without the PreTouch flag, in order to try to induce situations 
>> where an on-demand allocated 2MB heap page hits the slow path, and the 
>> effect shows up in your netty latency measurements.
>>
>> All the above are obviously experimentation starting points, and may 
>> take some iteration to actually induce the demonstrated high latencies we 
>> are looking for. But once you are able to demonstrate the impact of 
>> on-demand allocation doing direct (synchronous) compaction both in your 
>> application latency measurement and in your kernel tracing data, you would 
>> then be able to try the same experiment with the defrag setting set to 
>> "defer" to show how newer kernels and this new setting now make it safe (or 
>> at least much more safe) to use THP. And with that actually demonstrated, 
>> everything about THP recommendations for freeze-averse applications can 
>> change, making for a really great posting.
>>
>> Sent from my iPad
>>
>> On Aug 18, 2017, at 3:00 AM, Alexandr Nikitin <[email protected]> 
>> wrote:
>>
>> I decided to write a post about measuring the performance impact 
>> (otherwise it stays in my messy notes forever)  
>> Any feedback is appreciated.
>>
>> https://alexandrnikitin.github.io/blog/transparent-hugepages-measuring-the-performance-impact/
>>
>> On Saturday, August 12, 2017 at 1:01:31 PM UTC+3, Alexandr Nikitin wrote: 
>>>
>>> I played with Transparent Hugepages some time ago and I want to share 
>>> some numbers based on real world high-load applications.
>>> We have a JVM application: high-load tcp server based on netty. No clear 
>>> bottleneck, CPU, memory and network are equally highly loaded. The amount 
>>> of work depends on request content.
>>> The following numbers are based on normal server load ~40% of maximum 
>>> number of requests one server can handle.
>>>
>>> *When THP is off:*
>>> End-to-end application latency in microseconds:
>>> "p50" : 718.891,
>>> "p95" : 4110.26,
>>> "p99" : 7503.938,
>>> "p999" : 15564.827,
>>>
>>> perf stat -e dTLB-load-misses,iTLB-load-misses -p PID -I 1000
>>> ...
>>> ...         25,164,369      iTLB-load-misses
>>> ...         81,154,170      dTLB-load-misses
>>> ...
>>>
>>> *When THP is always on:*
>>> End-to-end application latency in microseconds:
>>> "p50" : 601.196,
>>> "p95" : 3260.494,
>>> "p99" : 7104.526,
>>> "p999" : 11872.642,
>>>
>>> perf stat -e dTLB-load-misses,iTLB-load-misses -p PID -I 1000
>>> ...
>>> ...    21,400,513      dTLB-load-misses
>>> ...      4,633,644      iTLB-load-misses
>>> ...
>>>
>>> As you can see THP performance impact is measurable and too significant 
>>> to ignore. 4.1 ms vs 3.2 ms 99%% and 100M vs 25M TLB misses.
>>> I also used SytemTap to measure few kernel functions like 
>>> collapse_huge_page, clear_huge_page, split_huge_page. There were no 
>>> significant spikes using THP.
>>> AFAIR that was 3.10 kernel which is 4 years old now. I can repeat 
>>> experiments with the newer kernels if there's interest. (I don't know what 
>>> was changed there though)
>>>
>>>
>>> On Monday, August 7, 2017 at 6:42:21 PM UTC+3, Peter Veentjer wrote: 
>>>>
>>>> Hi Everyone,
>>>>
>>>> I'm failing to understand the problem with transparent huge pages.
>>>>
>>>> I 'understand' how normal pages work. A page is typically 4kb in a 
>>>> virtual address space; each process has its own. 
>>>>
>>>> I understand how the TLB fits in; a cache providing a mapping of 
>>>> virtual to real addresses to speed up address conversion.
>>>>
>>>> I understand that using a large page e.g. 2mb instead of a 4kb page can 
>>>> reduce pressure on the TLB.
>>>>
>>>> So till so far it looks like huge large pages makes a lot of sense; of 
>>>> course at the expensive of wasting memory if only a small section of a 
>>>> page 
>>>> is being used. 
>>>>
>>>> The first part I don't understand is: why is it called transparent huge 
>>>> pages? So what is transparent about it? 
>>>>
>>>> The second part I'm failing to understand is: why can it cause 
>>>> problems? There are quite a few applications that recommend disabling THP 
>>>> and I recently helped a customer that was helped by disabling it. It seems 
>>>> there is more going on behind the scene's than having an increased page 
>>>> size. Is it caused due to fragmentation? So if a new page is needed and 
>>>> memory is fragmented (due to smaller pages); that small-pages need to be 
>>>> compacted before a new huge page can be allocated? But if this would be 
>>>> the 
>>>> only thing; this shouldn't be a problem once all pages for the application 
>>>> have been touched and all pages are retained.
>>>>
>>>> So I'm probably missing something simple.
>>>>
>>>> -- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "mechanical-sympathy" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/mechanical-sympathy/sljzehnCNZU/unsubscribe
>> .
>> To unsubscribe from this group and all its topics, send an email to 
>> [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to