[OMPI devel] shared-memory allocations

Eugene Loh Wed, 10 Dec 2008 13:07:44 -0500

For shared memory communications, each on-node connection (non-self,sender-receiver pair) gets a circular buffer during MPI_Init(). Each CBrequires the following allocations:


*) ompi_cb_fifo_wrapper_t (roughly 64 bytes)
*) ompi_cb_fifo_ctl_t head (roughly 12 bytes)
*) ompi_cb_fifo_ctl_t tail (roughly 12 bytes)
*) queue (roughly 1024 bytes)

Importantly, the current code lays these four allocations out on threeseparate pages. (The tail and queue are aggregated together.) So, forexample, that "head" allocation (12 bytes) ends up consuming a full page.

As one goes to more and more on-node processes -- say, for a large SMPor a multicore system -- the number of non-self connections grows asn*(n-1). So, these circular-buffer allocations end up consuming a lotof shared memory.

For example, for a 4K pagesize and n=512 on-node processes, the circularbuffers consume 3 Gbyte of memory -- 90% of which is empty and simplyused for page alignment.


I'd like to aggregate more of these allocations so that:

*) shared-memory consumption is reduced

*) the number of allocations (and hence the degree of lock contention)during MPI_Init is reduced


Any comments?

I'd like to understand the original rationale for these pagealignments. I expect this is related to memory placement of pages. So,I imagine three scenarios. Which is it?

A) There really is a good reason for each allocation to have its ownpage and any attempt to aggregate is doomed.

B) There is actual benefit for placing things carefully in memory, butsubstantial aggregation is still possible. That is, for n processes, weneed at most n different allocations -- not 3*n*(n-1).

C) There is no actual justification for having everything on differentpages. That is, allowing different parts of a FIFO CB to be mappeddifferently to physical memory sounded to someone like a good idea atthe time, but no one really did any performance measurements to justifythis. Or, if they did, it was only on one platform and we have noevidence that the same behavior exists on all platforms. Personally,I've played with some simple experiments on one (or more?) platforms andfound no performance variations due to placement of shared variablesthat two processes use for communication. I guess it's possible thatdata is moving cache-to-cache and doesn't care where the backing memory is.

Note that I only want to reduce the number of page-aligned allocations.I'd preserve cacheline alignment. So, no worry about false sharing dueto a sender thrashing on one end of a FIFO and a receiver on the other.

[OMPI devel] shared-memory allocations

Reply via email to