Re: [OMPI devel] shared-memory allocations

Jeff Squyres Fri, 12 Dec 2008 08:46:39 -0500

On Dec 10, 2008, at 1:11 PM, Eugene Loh wrote:

For shared memory communications, each on-node connection (non-self,sender-receiver pair) gets a circular buffer during MPI_Init().Each CB requires the following allocations:
*) ompi_cb_fifo_wrapper_t (roughly 64 bytes)
*) ompi_cb_fifo_ctl_t head (roughly 12 bytes)
*) ompi_cb_fifo_ctl_t tail (roughly 12 bytes)
*) queue (roughly 1024 bytes)
Importantly, the current code lays these four allocations out onthree separate pages. (The tail and queue are aggregatedtogether.) So, for example, that "head" allocation (12 bytes) endsup consuming a full page.
As one goes to more and more on-node processes -- say, for a largeSMP or a multicore system -- the number of non-self connectionsgrows as n*(n-1). So, these circular-buffer allocations end upconsuming a lot of shared memory.
For example, for a 4K pagesize and n=512 on-node processes, thecircular buffers consume 3 Gbyte of memory -- 90% of which is emptyand simply used for page alignment.
I'd like to aggregate more of these allocations so that:

*) shared-memory consumption is reduced
*) the number of allocations (and hence the degree of lockcontention) during MPI_Init is reduced


This certainly seems like a good idea to me.

I'd like to understand the original rationale for these pagealignments. I expect this is related to memory placement of pages.So, I imagine three scenarios. Which is it?
A) There really is a good reason for each allocation to have its ownpage and any attempt to aggregate is doomed.
B) There is actual benefit for placing things carefully in memory,but substantial aggregation is still possible. That is, for nprocesses, we need at most n different allocations -- not 3*n*(n-1).
C) There is no actual justification for having everything ondifferent pages. That is, allowing different parts of a FIFO CB tobe mapped differently to physical memory sounded to someone like agood idea at the time, but no one really did any performancemeasurements to justify this. Or, if they did, it was only on oneplatform and we have no evidence that the same behavior exists onall platforms. Personally, I've played with some simple experimentson one (or more?) platforms and found no performance variations dueto placement of shared variables that two processes use forcommunication. I guess it's possible that data is moving cache-to-cache and doesn't care where the backing memory is.
Note that I only want to reduce the number of page-alignedallocations. I'd preserve cacheline alignment. So, no worry aboutfalse sharing due to a sender thrashing on one end of a FIFO and areceiver on the other.

I thought that Rich had prior data from other architectures to justifydoing it this way, but I don't know that for a fact...

I *think* the rationale was that if you have a very busy producer/consumer, the benefit was that they could operate independently andnot affect the other. Since they were both operating on differentpages, writes to one page would not affect reads from the other.Perhaps this has not worked out in practice...?


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] shared-memory allocations

Reply via email to