On Nov 14, 2008, at 10:56 AM, Eugene Loh wrote:
I too am interested - I think we need to do something about the sm
backing file situation as larger core machines are slated to
become more prevalent shortly.
I think there is at least one piece of low-flying fruit: get rid of
a lot of the page alignments. Especially as one goes to large core
counts, the O(n^2) number of local "connections" becomes important,
and each connection starts with three page-aligned allocations, each
allocation very tiny (and hence uses only a tiny portion of the page
+ that is allocated to it). So, most of the allocated memory is
never used.
Personally, I question the rationale for the page alignment in the
first place, but don't mind listening to anyone who wants to explain
it to me. Presumably, in a NUMA machine, localizing FIFOs to
separate physical memory improves performance. I get that basic
premise. I just question the reasoning beyond that.
I think the original rationale was that only pages could be physically
pinned (not cache lines).
Slightly modifying Eugene's low-hanging fruit might be to figure out
which processes are local to each other (e.g., on cores on the same
socket) where memory local to all the cores on a socket. These
processes' data could be shared contiguously (perhaps even within a
single page, depending on how many cores are there) instead of on
individual pages. Specifically: use page alignments for groups of
processes that have the same memory locality.
The page alignment appears in ompi_fifo_init and ompi_cb_fifo_init.
It comes additionally from mca_mpool_sm_alloc. Four minor changes
could change alignment from page to cacheline size.
what happens when there isn't enough memory to support all this?
Are we smart enough to detect this situation? Does the sm
subsystem quietly shut down? Warn and shut down? Segfault?
I'm not exactly sure. I think it's a combination of three things:
*) some attempt to signal problems correctly
*) some degree just to live with less shared memory (possibly
leading to performance degradation)
*) poorly tested in any case
I have two examples so far:
1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a single
node, 2ppn, with btl=openib,sm,self. The program started, but
segfaulted on the first MPI_Send. No warnings were printed.
2. again with a ramdisk, /tmp was reportedly set to 16MB
(unverified - some uncertainty, could be have been much larger).
OMPI was run on multiple nodes, 16ppn, with btl=openib,sm,self.
The program ran to completion without errors or warning. I don't
know the communication pattern - could be no local comm was
performed, though that sounds doubtful.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems