For the mailing list...
Note that we moved this conversation to a higher bandwidth
(telephone). If others are interested, please let us know.
On Sep 3, 2008, at 1:21 AM, Eugene Loh wrote:
Jeff Squyres wrote:
I think even first-touch will make *the whole page* be local to
the process that touches it.
Right.
So if you have each process take N bytes (where N << page_size),
then the 0th process will make that whole page be local; it may be
remote for others.
I think I'm not making myself clear. Read on...
*) You wouldn't need to control memory allocations with a lock
(except for multithreaded apps). I haven't looked at this too
closely yet, but the 3*n*n memory allocations in shared memory
during MPI_Init are currently serialized, which sounds disturbing
when n is 100 to 500 local processes.
If I'm understanding your proposal right, you're saying that each
process would create its own shared memory space, right? Then any
other process that wants to send to that process would mmap/
shmattach/ whatever to the receiver's shared memory space. Right?
I don't think it's necessary to have each process have its own
segment. The OS manages the shared area on a per-page basis
anyhow. All that's necessary is that there is an agreement up front
about which pages will be local to which process. E.g., if there
are P processes/processors and the shared area has M pages per
process, then there will be P*M pages altogether. We'll say that
the first M pages are local to process 0, then next m to process 1,
etc. That is, process 0 will first-touch the first M pages, process
1 will first-touch the next M pages, etc. If an allocation needs to
be local to process i, then process i will allocate it from its
pages. Since only process i can allocate from these pages, it does
not need any lock protection to keep other processes from allocating
at the same time. And, since these pages have the proper locality,
then small allocations can all share common pages (instead of having
a separate page for each 12-byte or 64-byte allocation).
Clearer? One shared memory region, partitioned equally among all
processes. Each process first-touches its own pages to get the
right locality. Each allocation made by the process to whom it
should be local. Benefits include no multi-process locks and no
need for page alignment of tiny allocations.
The total amount of shared memory will likely not go down, because
the OS will still likely allocate on a per-page basis, right?
Total amount would go down significantly. Today, if you want to
allocate 64 bytes on a page boundary, you allocate 64+pagesize, a
100x overhead. With what I'm (evidently not so clearly) proposing
is that we establish a policy about what memory will be local to
whom. With that policy, we simply allocate our 64 bytes in the
appropriate region. This eliminates the need for page alignment
(page is already in the right place, shared by many allocations all
of whom want to be there). You could still want cacheline
alignment... that's fine.
But per your 2nd point, would the resources required for each
process to mmap/ shmattach/whatever 511 other process' shared
memory spaces be prohibitive?
No need to have more shared memory segments. Just need a policy to
say how your global space is partitioned.
Graham, Richard L. wrote:
I have not looked at the code in a long time, so not sure how
many things have changed ... In general what you are suggesting
is reasonable. However, especially on large machines you also
need to worry about memory locality, so should allocate from
memory pools that are appropriately located. I expect that
memory allocated on a per-socket basis would do.
Is this what "maffinity" and "memory nodes" are about? If so, I
would think memory locality should be handled there rather than
in page alignment of individual 12-byte and 64-byte allocations.
maffinity was a first stab at memory affinity and is currently
(and has been for a long, long time) no frills and didn't have a
lot of thought put into it.
I see the "node id" and "bind" functions in there; I think Gleb
must have added them somewhere along the way. I'm not sure how
much thought was put into making those be truly generic functions
(I see them implemented in libnuma, which AFAIK is Linux-
specific). Does Solaris have memory affinity function calls?
Yes, I believe so, though perhaps I don't understand your question.
Things like mbind() and numa_setlocal_memory() are, I assume, Linux
calls for placing some memory close to a process. I think the
Solaris madvise() call does this: give a memory range and say
something about how that memory should be placed -- e.g., the memory
should be placed local to the next thread to touch that memory.
Anyhow, I think the default policy is "first touch", so one could
always do that.
I'm not an expert on this stuff, but I just wanted to reassure you
that Solaris supports NUMA programming. There are interfaces for
discovering the NUMA topology of a machine (there is a hierarchy of
"locality groups", each containing CPUs and memory), for discovering
in which locality group you are, for advising the VM system where
you want memory placed, and for querying where certain memory is. I
could do more homework on these matters if it'd be helpful.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems