On Wed, Feb 14, 2018 at 2:26 PM, David Mathog <mat...@caltech.edu> wrote: > Checked the hugepage settings and found a difference there. The two systems > that don't do this have /sys/kernel/mm/redhat_transparent_hugepage/defrag > > always madvise [never] > > whereas the system with the issue has: > > [always] madvise never
THP defragmentation is definitely something that has bitten us in the past, when under memory pressure, and we now default to [madvise] pretty much everywhere (we're too timid to disable it entirely). A good way to see if that's really the issue is to "echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag" while the problem is happening, while simultaneously monitoring the processes with htop, for instance. It's usually pretty instant: if the issue is really with THP defrag, then CPU usage for your stalling process should drop pretty much immediately and things go back to normal. Cheers, -- Kilian _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf