Re: [OMPI devel] MALLOC_MMAP_MAX (and MALLOC_MMAP_THRESHOLD)

Sylvain Jeaugey Fri, 8 Jan 2010 09:24:50 -0500

On Thu, 7 Jan 2010, Eugene Loh wrote:

Could someone tell me how these settings are used in OMPI or give anyguidance on how they should or should not be used?

This is a very good question :-) As this whole e-mail, though it's hard(in my opinion) to give it a Good (TM) answer.

This means that if you loop over the elements of multiple large arrays(which is common in HPC), you can generate a lot of cache conflicts,depending on the cache associativity.

On the other hand, high buffer alignment sometimes gives betterperformance (e.g. Infiniband QDR bandwidth).

There are multiple reasons one might want to modify the behavior of thememory allocator, including high cost of mmap calls, wanting to registermemory for faster communications, and now this cache-conflict issue. Theusual solution is
setenv MALLOC_MMAP_MAX_        0
setenv MALLOC_TRIM_THRESHOLD_ -1

or the equivalent mallopt() calls.

But yes, this set of settings is the number one tweak on HPC code that I'maware of.

This issue becomes an MPI issue for at least three reasons:
*) MPI may care about these settings due to memory registration and pinning.(I invite you to explain to me what I mean. I'm talking over my head here.)

Avoiding mmap is good since it prevents from calling munmap (a function weneed to hack to prevent data corruption).

*) (Related to the previous bullet), MPI performance comparisons may reflectthese effects. Specifically, in comparing performance of OMPI, Intel MPI,Scali/Platform MPI, and MVAPICH2, some tests (such as HPCC and SPECmpi) haveshown large performance differences between the various MPIs when, it seems,none were actually spending much time in MPI. Rather, some MPIimplementations were turning off large-malloc mmaps and getting goodperformance (and sadly OMPI looked bad in comparison).

I don't think this bullet is related to the previous one. The first one isa good reason, this one is typically the Bad reason. Bad, butunfortunately true : competitors' MPI libraries are faster because ...they do much more than MPI (accelerate malloc being the main difference).Which I think is Bad, because all these settings should be let indevelopper's hands. You'll always find an application where these settingswill waste memory and prevent an application from running.

*) These settings seem to be desirable for HPC codes since they don't domuch allocation/deallocation and they do tend to have loop nests that wadethrough multiple large arrays at once. For best "out of the box"performance, a software stack should turn these settings on for HPC. Codesdon't typically identify themselves as "HPC", but some indicators includeFortran, OpenMP, and MPI.

In practice, I agree. Most HPC codes benefit from it. But I also ran intocodes where the memory waste was a problem.

I don't know the full scope of the problem, but I've run into this with atleast HPCC STREAM (which shouldn't depend on MPI at all, but OMPI looks muchslower than Scali/Platform on some tests) and SPECmpi (primarily one or twocodes, though it depends also on problem size).

I had also those codes in mind. That's also why I don't like those MPI"benchmarks", since they benchmark much more than MPI. They henceencourage MPI provider to incorporate into their libraries things thathave (more or less) nothing to do with MPI.

But again, yes, from the (basic) user point of view, library X seemsfaster than library Y. When there is nothing left to improve on MPI, startoptimizing the rest .. maybe we should reimplement a faster libc insideMPI :-)


Sylvain

Re: [OMPI devel] MALLOC_MMAP_MAX (and MALLOC_MMAP_THRESHOLD)

Reply via email to