Jeff Squyres wrote:
I think even first-touch will make *the whole page* be local to the
process that touches it.
Right.
So if you have each process take N bytes (where N << page_size), then
the 0th process will make that whole page be local; it may be remote
for others.
I think I'm not making myself clear. Read on...
*) You wouldn't need to control memory allocations with a lock
(except for multithreaded apps). I haven't looked at this too
closely yet, but the 3*n*n memory allocations in shared memory
during MPI_Init are currently serialized, which sounds disturbing
when n is 100 to 500 local processes.
If I'm understanding your proposal right, you're saying that each
process would create its own shared memory space, right? Then any
other process that wants to send to that process would mmap/shmattach/
whatever to the receiver's shared memory space. Right?
I don't think it's necessary to have each process have its own segment.
The OS manages the shared area on a per-page basis anyhow. All that's
necessary is that there is an agreement up front about which pages will
be local to which process. E.g., if there are P processes/processors
and the shared area has M pages per process, then there will be P*M
pages altogether. We'll say that the first M pages are local to process
0, then next m to process 1, etc. That is, process 0 will first-touch
the first M pages, process 1 will first-touch the next M pages, etc. If
an allocation needs to be local to process i, then process i will
allocate it from its pages. Since only process i can allocate from
these pages, it does not need any lock protection to keep other
processes from allocating at the same time. And, since these pages have
the proper locality, then small allocations can all share common pages
(instead of having a separate page for each 12-byte or 64-byte allocation).
Clearer? One shared memory region, partitioned equally among all
processes. Each process first-touches its own pages to get the right
locality. Each allocation made by the process to whom it should be
local. Benefits include no multi-process locks and no need for page
alignment of tiny allocations.
The total amount of shared memory will likely not go down, because
the OS will still likely allocate on a per-page basis, right?
Total amount would go down significantly. Today, if you want to
allocate 64 bytes on a page boundary, you allocate 64+pagesize, a 100x
overhead. With what I'm (evidently not so clearly) proposing is that we
establish a policy about what memory will be local to whom. With that
policy, we simply allocate our 64 bytes in the appropriate region. This
eliminates the need for page alignment (page is already in the right
place, shared by many allocations all of whom want to be there). You
could still want cacheline alignment... that's fine.
But per your 2nd point, would the resources required for each process
to mmap/ shmattach/whatever 511 other process' shared memory spaces
be prohibitive?
No need to have more shared memory segments. Just need a policy to say
how your global space is partitioned.
Graham, Richard L. wrote:
I have not looked at the code in a long time, so not sure how many
things have changed ... In general what you are suggesting is
reasonable. However, especially on large machines you also need to
worry about memory locality, so should allocate from memory pools
that are appropriately located. I expect that memory allocated on
a per-socket basis would do.
Is this what "maffinity" and "memory nodes" are about? If so, I
would think memory locality should be handled there rather than in
page alignment of individual 12-byte and 64-byte allocations.
maffinity was a first stab at memory affinity and is currently (and
has been for a long, long time) no frills and didn't have a lot of
thought put into it.
I see the "node id" and "bind" functions in there; I think Gleb must
have added them somewhere along the way. I'm not sure how much
thought was put into making those be truly generic functions (I see
them implemented in libnuma, which AFAIK is Linux-specific). Does
Solaris have memory affinity function calls?
Yes, I believe so, though perhaps I don't understand your question.
Things like mbind() and numa_setlocal_memory() are, I assume, Linux
calls for placing some memory close to a process. I think the Solaris
madvise() call does this: give a memory range and say something about
how that memory should be placed -- e.g., the memory should be placed
local to the next thread to touch that memory. Anyhow, I think the
default policy is "first touch", so one could always do that.
I'm not an expert on this stuff, but I just wanted to reassure you that
Solaris supports NUMA programming. There are interfaces for discovering
the NUMA topology of a machine (there is a hierarchy of "locality
groups", each containing CPUs and memory), for discovering in which
locality group you are, for advising the VM system where you want memory
placed, and for querying where certain memory is. I could do more
homework on these matters if it'd be helpful.