Re: [OMPI devel] SM backing file size

Jeff Squyres Fri, 14 Nov 2008 11:15:44 -0500

On Nov 14, 2008, at 10:56 AM, Eugene Loh wrote:

I too am interested - I think we need to do something about the smbacking file situation as larger core machines are slated tobecome more prevalent shortly.
I think there is at least one piece of low-flying fruit: get rid ofa lot of the page alignments. Especially as one goes to large corecounts, the O(n^2) number of local "connections" becomes important,and each connection starts with three page-aligned allocations, eachallocation very tiny (and hence uses only a tiny portion of the page+ that is allocated to it). So, most of the allocated memory isnever used.
Personally, I question the rationale for the page alignment in thefirst place, but don't mind listening to anyone who wants to explainit to me. Presumably, in a NUMA machine, localizing FIFOs toseparate physical memory improves performance. I get that basicpremise. I just question the reasoning beyond that.

I think the original rationale was that only pages could be physicallypinned (not cache lines).

Slightly modifying Eugene's low-hanging fruit might be to figure outwhich processes are local to each other (e.g., on cores on the samesocket) where memory local to all the cores on a socket. Theseprocesses' data could be shared contiguously (perhaps even within asingle page, depending on how many cores are there) instead of onindividual pages. Specifically: use page alignments for groups ofprocesses that have the same memory locality.

The page alignment appears in ompi_fifo_init and ompi_cb_fifo_init.It comes additionally from mca_mpool_sm_alloc. Four minor changescould change alignment from page to cacheline size.
what happens when there isn't enough memory to support all this?Are we smart enough to detect this situation? Does the smsubsystem quietly shut down? Warn and shut down? Segfault?
I'm not exactly sure.  I think it's a combination of three things:

*) some attempt to signal problems correctly
*) some degree just to live with less shared memory (possiblyleading to performance degradation)
*) poorly tested in any case
I have two examples so far:
1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a singlenode, 2ppn, with btl=openib,sm,self. The program started, butsegfaulted on the first MPI_Send. No warnings were printed.
2. again with a ramdisk, /tmp was reportedly set to 16MB(unverified - some uncertainty, could be have been much larger).OMPI was run on multiple nodes, 16ppn, with btl=openib,sm,self.The program ran to completion without errors or warning. I don'tknow the communication pattern - could be no local comm wasperformed, though that sounds doubtful.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] SM backing file size

Reply via email to