>> > 
> 
> On 12/12/08 8:21 PM, "Eugene Loh" <eugene....@sun.com> wrote:
> 
> Richard Graham wrote:
> Re: [OMPI devel] shared-memory allocations The memory allocation is intended
to take into account that two separate procs may be touching the same memory, so
the intent is to reduce cache conflicts (false sharing)
> Got it.  I'm totally fine with that.  Separate cachelines.
> and put the memory close to the process that is using it.
> Problematic concept, but ... okay, I'll read on.
> When the code first went in, there was no explicit memory affinity
implemented, so first-touch was relied on to get the memory in the øcorrectø
location.
> 
> Okay.
> If I remember correctly, the head and the tail each are written to be a
different process, and is where the pointers and counters used to manage the
fifo are maintained.  They need to be close to the writer, and on separate cache
lines, to avoid false sharing.
> Why close to the writer (versus reader)?
> 
> Anyhow, so far as I can tell, the 2d structure ompi_fifo_t
fifo[receiver][sender] is organized by receiver.  That is, the main ompi_fifo_t
FIFO data structures are local to receivers.
> 
> But then, each FIFO is initialized (that is, circular buffers and associated
allocations) by senders.  E.g.,
https://svn.open-mpi.org/trac/ompi/browser/branches/v1.3/ompi/mca/btl/Smylers/bt
l_sm.c?version=19785#L537
> In the call to ompi_fifo_init(), all the circular buffer (CB) data structures
are allocated by the sender.  On different cachelines -- even different pages --
but all by the sender.

It does not make a difference who allocates it, what makes a difference is
who touches it first.

> 
> Specifically, one accesses FIFO on the receiver side then follow pointers to
the senders side.  Doesn't matter if you're talking head, tail, or queue.
> The queue itself is accessed most often by the reader,
> You mean because it's polling often, but writer writes only once?

Yes - it is polling volatile memory, so has to load from memory on every
read.

> so it should be closer to the reader.
> Are there measurements to substantiate this?  Seems to me that in a
cache-based system, a reader could poll on a remote location all it wanted and
there'd be traffic only if the cached copy were invalidated.  Conceivably, a
transfer could go cache-to-cache and not hit memory at all.  I tried some
measurements and found no difference for any location -- close to writer, close
to reader, or far from both.
> I honestly donøt remember much about the wrapper ø would have to go back to
the code to look at it.  If we no longer allow multiple fifo per pair, the
wrapper layer can go away ø it is there to manage multiple fifoøs per pair.
> 
> There is support for multiple circular buffers per FIFO.

The code is there, but I believe Gleb disabled using multiple fifo's, and
added a list to hold pending
messages, so now we are paying two overheads ...  I could be wrong here, but
am pretty sure I am not.
I don't know if George has touched the code since.


> As far as granularity of allocation ø it needs to be large enough to
accommodate the smallest shared memory hierarchy, so I suppose in the most
general case this may be the tertiary cache ?
> 
> I don't get this.  I understand how certain things should be on separate
cachelines.  Beyond that, we just figure out what should be local to a process
and allocate all those things together.  That takes us from 3*n*n allocations
(and pages) to just n of them.

Not sure what you point is here.  The cost per process is linear in the
total number of processes, so
overall the cost scales as the number of procs squared.  This was designed
for small smp's, to reduce
coordination costs between processes, and where memory costs are not large.
Once can go to very simple
schemes that are constant with respect to memory footprint, but then pay the
cost of multiple writers
to a single queue - this is what LA-MPI did.

> No reason not to allocate objects that need to be associated with the same
process on the same page, as long as one avoids false sharing.
> Got it.
> So seems like each process could have all of itøs receive fifoøs on the same
page, and these could share the also with either the heads, or the tails of each
queue.

Yes, this makes sense.

Rich

> 
> I will propose some specifics and run them by y'all.  I think I know enough to
get started.  Thanks for the comments.
> 
> 

Reply via email to