Re: failing to understand the issues with transparent huge paging

Peter Booth Thu, 24 Aug 2017 04:16:37 -0700

I agree completely. For me it's a no-brainer. I have missed countless
night's sleep,
Thanksgiving dinners, weekends, vacations because of buggy code. I don't
complain -
it comes with the territory. I've taken short term consulting jobs where i
worked close
to 24hrs a day helping resolve critical outages whilst my vacationing
family were in
 a swimming pool and I sat inside with a laptop. So I appreciate stability.
THP is a great
idea with a broken implementation.


Life is too short to deploy known broken configurations.

 transparent_hugepage=never has worked well for me so far.

On Wed, Aug 23, 2017 at 1:51 PM, Tom Lee <[email protected]> wrote:

> Peter, just want to say I've also seen very similar behavior with JVM heap
> sizes ~16GB. I feel like I've seen multiple "failure" modes with THP, but
> most alarmingly we observed brief system-wide lockups in some cases,
> similar to those described in: https://access.redhat.com/solutions/1560893.
> (Don't quite recall if we saw that exact "soft lockup" message, but do
> recall something similar -- and around the time we saw that message we also
> observed gaps in the output of a separate shell script that was
> periodically writing a message to a file every 5 seconds.)
>
> I'm probably just scarred from the experience, but to me the question of
> whether to leave THP=always in such environments feels more like "do I want
> to gamble on this pathological behavior occurring?" than some dial for fine
> tuning performance. Maybe it's better in more recent RHEL kernels, but
> never really had a reason to roll the dice on it.
>
> (This shouldn't scare folks off [non-transparent] hugepages entirely
> though -- had much better results with those.)
>
> On Wed, Aug 23, 2017 at 3:52 AM, Peter Booth <[email protected]> wrote:
>
>>
>> Some points:
>>
>> Those of us working in large corporate settings are likely to be running
>> close to vanilla RHEL 7.3 or 6.9 with kernel versions 3.10.0-514 or 
>> 2.6.32-696
>> respectively.
>>
>>  I have seen the THP issue first hand in a dramatic fashion. One Java
>> trading application I supported ran with heaps that ranged from 32GB to
>> 64GB,
>> running on Azul Zing, with no appreciable GC pauses. It was migrated from
>> Westmere hardware on RHEL 5.6 to (faster) Ivy Bridge hardware on RHEL 6.4.
>> In non-production environments only, the application suddenly began
>> showing occasional pauses of upto a few seconds. Occasional meaning only
>> four or five out of 30 instances showed a pause, and they might only have
>> one or two or three pauses in a day. These instances ran a workload that
>> replicated a production workload.  I noticed that the only difference
>> between these hosts and the healthy production hosts was that, due to human
>> error,
>> THP was disabled on the production hosts but not the non-prod hosts. As
>> soon as we disabled THP on the non-prod hosts the pauses disappeared.
>>
>> This was a reactive discovery - I haven't done any proactive
>> investigation of the effects of THP. This was sufficient for me to rule it
>> out for today.
>>
>>
>>
>>
>> On Sunday, August 20, 2017 at 10:32:45 AM UTC-4, Alexandr Nikitin wrote:
>>>
>>> Thank you for the feedback! Appreciate it. Yes, you are right. The
>>> intention was not to show that THP is an awesome feature but to share
>>> techniques to measure and control risks. I made the changes
>>> <https://github.com/alexandrnikitin/blog/compare/2139c405f0c50a3ab907fb2530421bf352caa412...3e58094386b14d19e06752d9faa0435be2cbe651>to
>>> highlight the purpose and risks.
>>>
>>> The experiment is indeed interesting. I believe the "defer" option
>>> should help in that environment. I'm really keen to try the latest kernel
>>> (related not only to THP).
>>>
>>> *Frankly, I still don't have strong opinion about huge latency spikes in
>>> allocation path in general. I'm not sure whether it's a THP issue or
>>> application/environment itself. Likely it's high memory pressure in general
>>> that causes spikes. Or the root of the issues is in something else, e.g.
>>> the jemalloc case.*
>>>
>>>
>>> On Friday, August 18, 2017 at 6:32:40 PM UTC+3, Gil Tene wrote:
>>>>
>>>> This is very well written and quite detailed. It has all the makings of
>>>> a great post I'd point people to. However, as currently stated, I'd worry
>>>> that it would (mis)lead readers into using THP with "always"
>>>> /sys/kernel/mm/transparent_hugepage/defrag settings (instead of
>>>> "defer"), and/or on older (pre-4.6) kernels with a false sense that the
>>>> many-msec slow path allocation latency problems many people warn about
>>>> don't actually exist. You do link to the discussions on the subject, but
>>>> the measurements and summary conclusion of the posting alone would not end
>>>> up warning people who don't actually follow those links.
>>>>
>>>> I assume your intention is not to have the reader conclude that "there
>>>> is lots of advise out there telling you to turn off THP, and it is wrong.
>>>> Turning it on is perfectly safe, and may significantly speed up your
>>>> application", but are instead are aiming for something like "THP used to be
>>>> problematic enough to cause wide ranging recommendations to simply turn it
>>>> off, but this has changed with recent Linux kernels. It is now safe to use
>>>> in widely applicable ways (will th the right settings) and can really help
>>>> application performance without risking huge stalls". Unfortunately, I
>>>> think that many readers would understand the current text as the former,
>>>> not the latter.
>>>>
>>>> Here is what I'd change to improve on the current text:
>>>>
>>>> 1. Highlight the risk of high slow path allocation latencies with the
>>>> "always" (and even "madvise") setting in /sys/kernel/mm/transparent_
>>>> hugepage/defrag, the fact that the "defer" option is intended to
>>>> address those risks, and this defer option is available with Linux kernel
>>>> versions 4.6 or later.
>>>>
>>>> 2. Create an environment that would actually demonstrate these very
>>>> high (many msec or worse) latencies in the allocation slow path with defrag
>>>> set to "always". This is the part that will probably take some extra work,
>>>> but it will also be a very valuable contribution. The issues are so widely
>>>> reported (into the 100s of msec or more, and with a wide verity of
>>>> workloads as your links show) that intentional reproduction *should* be
>>>> possible. And being able to demonstrate it actually happening will also
>>>> allow you to demonstrate how newer kernels address it with the defer
>>>> setting.
>>>>
>>>> 3. Show how changing the defrag setting to "defer" removes the high
>>>> latencies seen by the allocation slow path under the same conditions.
>>>>
>>>> For (2) above, I'd look to induce a situation where the allocation slow
>>>> path can't find a free 2MB page without having to defragment one directly.
>>>> E.g.
>>>> - I'd start by significantly slowing down the background
>>>> defragmentation in khugepaged (e.g set /sys/kernel/mm/transparent
>>>> _hugepage/khugepaged/scan_sleep_millisecs to 3600000). I'd avoid
>>>> turning it off completely in order to make sure you are still measuring the
>>>> system in a configuration that believes it does background defragmentation.
>>>> - I'd add some static physical memory pressure (e.g. allocate and touch
>>>> a bunch of anonymous memory in a process that would just sit on it) such
>>>> that the system would only have 2-3GB free for buffers and your netty
>>>> workload's heap. A sleeping jvm launched with an empirically sized and big
>>>> enough -Xmx and -Xms and with AlwaysPretouch on is an easy way to do that.
>>>> - I'd then create an intentional and spiky fragmentation load (e.g.
>>>> perform spikes of a scanning through a 20GB file every minute or so).
>>>> - with all that in place, I'd then repeatedly launch and run your Netty
>>>> workload without the PreTouch flag, in order to try to induce situations
>>>> where an on-demand allocated 2MB heap page hits the slow path, and the
>>>> effect shows up in your netty latency measurements.
>>>>
>>>> All the above are obviously experimentation starting points, and may
>>>> take some iteration to actually induce the demonstrated high latencies we
>>>> are looking for. But once you are able to demonstrate the impact of
>>>> on-demand allocation doing direct (synchronous) compaction both in your
>>>> application latency measurement and in your kernel tracing data, you would
>>>> then be able to try the same experiment with the defrag setting set to
>>>> "defer" to show how newer kernels and this new setting now make it safe (or
>>>> at least much more safe) to use THP. And with that actually demonstrated,
>>>> everything about THP recommendations for freeze-averse applications can
>>>> change, making for a really great posting.
>>>>
>>>> Sent from my iPad
>>>>
>>>> On Aug 18, 2017, at 3:00 AM, Alexandr Nikitin <[email protected]>
>>>> wrote:
>>>>
>>>> I decided to write a post about measuring the performance impact
>>>> (otherwise it stays in my messy notes forever)
>>>> Any feedback is appreciated.
>>>> https://alexandrnikitin.github.io/blog/transparent-hugepages
>>>> -measuring-the-performance-impact/
>>>>
>>>> On Saturday, August 12, 2017 at 1:01:31 PM UTC+3, Alexandr Nikitin
>>>> wrote:
>>>>>
>>>>> I played with Transparent Hugepages some time ago and I want to share
>>>>> some numbers based on real world high-load applications.
>>>>> We have a JVM application: high-load tcp server based on netty. No
>>>>> clear bottleneck, CPU, memory and network are equally highly loaded. The
>>>>> amount of work depends on request content.
>>>>> The following numbers are based on normal server load ~40% of maximum
>>>>> number of requests one server can handle.
>>>>>
>>>>> *When THP is off:*
>>>>> End-to-end application latency in microseconds:
>>>>> "p50" : 718.891,
>>>>> "p95" : 4110.26,
>>>>> "p99" : 7503.938,
>>>>> "p999" : 15564.827,
>>>>>
>>>>> perf stat -e dTLB-load-misses,iTLB-load-misses -p PID -I 1000
>>>>> ...
>>>>> ...         25,164,369      iTLB-load-misses
>>>>> ...         81,154,170      dTLB-load-misses
>>>>> ...
>>>>>
>>>>> *When THP is always on:*
>>>>> End-to-end application latency in microseconds:
>>>>> "p50" : 601.196,
>>>>> "p95" : 3260.494,
>>>>> "p99" : 7104.526,
>>>>> "p999" : 11872.642,
>>>>>
>>>>> perf stat -e dTLB-load-misses,iTLB-load-misses -p PID -I 1000
>>>>> ...
>>>>> ...    21,400,513      dTLB-load-misses
>>>>> ...      4,633,644      iTLB-load-misses
>>>>> ...
>>>>>
>>>>> As you can see THP performance impact is measurable and too
>>>>> significant to ignore. 4.1 ms vs 3.2 ms 99%% and 100M vs 25M TLB misses.
>>>>> I also used SytemTap to measure few kernel functions like
>>>>> collapse_huge_page, clear_huge_page, split_huge_page. There were no
>>>>> significant spikes using THP.
>>>>> AFAIR that was 3.10 kernel which is 4 years old now. I can repeat
>>>>> experiments with the newer kernels if there's interest. (I don't know what
>>>>> was changed there though)
>>>>>
>>>>>
>>>>> On Monday, August 7, 2017 at 6:42:21 PM UTC+3, Peter Veentjer wrote:
>>>>>>
>>>>>> Hi Everyone,
>>>>>>
>>>>>> I'm failing to understand the problem with transparent huge pages.
>>>>>>
>>>>>> I 'understand' how normal pages work. A page is typically 4kb in a
>>>>>> virtual address space; each process has its own.
>>>>>>
>>>>>> I understand how the TLB fits in; a cache providing a mapping of
>>>>>> virtual to real addresses to speed up address conversion.
>>>>>>
>>>>>> I understand that using a large page e.g. 2mb instead of a 4kb page
>>>>>> can reduce pressure on the TLB.
>>>>>>
>>>>>> So till so far it looks like huge large pages makes a lot of sense;
>>>>>> of course at the expensive of wasting memory if only a small section of a
>>>>>> page is being used.
>>>>>>
>>>>>> The first part I don't understand is: why is it called transparent
>>>>>> huge pages? So what is transparent about it?
>>>>>>
>>>>>> The second part I'm failing to understand is: why can it cause
>>>>>> problems? There are quite a few applications that recommend disabling THP
>>>>>> and I recently helped a customer that was helped by disabling it. It 
>>>>>> seems
>>>>>> there is more going on behind the scene's than having an increased page
>>>>>> size. Is it caused due to fragmentation? So if a new page is needed and
>>>>>> memory is fragmented (due to smaller pages); that small-pages need to be
>>>>>> compacted before a new huge page can be allocated? But if this would be 
>>>>>> the
>>>>>> only thing; this shouldn't be a problem once all pages for the 
>>>>>> application
>>>>>> have been touched and all pages are retained.
>>>>>>
>>>>>> So I'm probably missing something simple.
>>>>>>
>>>>>> --
>>>> You received this message because you are subscribed to a topic in the
>>>> Google Groups "mechanical-sympathy" group.
>>>> To unsubscribe from this topic, visit https://groups.google.com/d/to
>>>> pic/mechanical-sympathy/sljzehnCNZU/unsubscribe.
>>>> To unsubscribe from this group and all its topics, send an email to
>>>> [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>> --
>> You received this message because you are subscribed to the Google Groups
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
> *Tom Lee */ http://tomlee.co / @tglee <http://twitter.com/tglee>
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "mechanical-sympathy" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/
> topic/mechanical-sympathy/sljzehnCNZU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: failing to understand the issues with transparent huge paging

Reply via email to