On Wed, Feb 14, 2018 at 2:26 PM, David Mathog <mat...@caltech.edu> wrote:
> Checked the hugepage settings and found a difference there.  The two systems
> that don't do this have  /sys/kernel/mm/redhat_transparent_hugepage/defrag
>
> always madvise [never]
>
> whereas the system with the issue has:
>
> [always] madvise never

THP defragmentation is definitely something that has bitten us in the
past, when under memory pressure, and we now default to [madvise]
pretty much everywhere (we're too timid to disable it entirely).

A good way to see if that's really the issue is to "echo never >
/sys/kernel/mm/redhat_transparent_hugepage/defrag" while the problem
is happening, while simultaneously monitoring the processes with htop,
for instance.
It's usually pretty instant:  if the issue is really with THP defrag,
then CPU usage for your stalling process should drop pretty much
immediately and things go back to normal.

Cheers,
--
Kilian
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to