Jeff Squyres wrote:

I think even first-touch will make *the whole page* be local to the process that touches it.

Right.

So if you have each process take N bytes (where N << page_size), then the 0th process will make that whole page be local; it may be remote for others.

I think I'm not making myself clear.  Read on...

*) You wouldn't need to control memory allocations with a lock (except for multithreaded apps). I haven't looked at this too closely yet, but the 3*n*n memory allocations in shared memory during MPI_Init are currently serialized, which sounds disturbing when n is 100 to 500 local processes.

If I'm understanding your proposal right, you're saying that each process would create its own shared memory space, right? Then any other process that wants to send to that process would mmap/shmattach/ whatever to the receiver's shared memory space. Right?

I don't think it's necessary to have each process have its own segment. The OS manages the shared area on a per-page basis anyhow. All that's necessary is that there is an agreement up front about which pages will be local to which process. E.g., if there are P processes/processors and the shared area has M pages per process, then there will be P*M pages altogether. We'll say that the first M pages are local to process 0, then next m to process 1, etc. That is, process 0 will first-touch the first M pages, process 1 will first-touch the next M pages, etc. If an allocation needs to be local to process i, then process i will allocate it from its pages. Since only process i can allocate from these pages, it does not need any lock protection to keep other processes from allocating at the same time. And, since these pages have the proper locality, then small allocations can all share common pages (instead of having a separate page for each 12-byte or 64-byte allocation).

Clearer? One shared memory region, partitioned equally among all processes. Each process first-touches its own pages to get the right locality. Each allocation made by the process to whom it should be local. Benefits include no multi-process locks and no need for page alignment of tiny allocations.

The total amount of shared memory will likely not go down, because the OS will still likely allocate on a per-page basis, right?

Total amount would go down significantly. Today, if you want to allocate 64 bytes on a page boundary, you allocate 64+pagesize, a 100x overhead. With what I'm (evidently not so clearly) proposing is that we establish a policy about what memory will be local to whom. With that policy, we simply allocate our 64 bytes in the appropriate region. This eliminates the need for page alignment (page is already in the right place, shared by many allocations all of whom want to be there). You could still want cacheline alignment... that's fine.

But per your 2nd point, would the resources required for each process to mmap/ shmattach/whatever 511 other process' shared memory spaces be prohibitive?

No need to have more shared memory segments. Just need a policy to say how your global space is partitioned.

Graham, Richard L. wrote:

I have not looked at the code in a long time, so not sure how many things have changed ... In general what you are suggesting is reasonable. However, especially on large machines you also need to worry about memory locality, so should allocate from memory pools that are appropriately located. I expect that memory allocated on a per-socket basis would do.

Is this what "maffinity" and "memory nodes" are about? If so, I would think memory locality should be handled there rather than in page alignment of individual 12-byte and 64-byte allocations.

maffinity was a first stab at memory affinity and is currently (and has been for a long, long time) no frills and didn't have a lot of thought put into it.

I see the "node id" and "bind" functions in there; I think Gleb must have added them somewhere along the way. I'm not sure how much thought was put into making those be truly generic functions (I see them implemented in libnuma, which AFAIK is Linux-specific). Does Solaris have memory affinity function calls?

Yes, I believe so, though perhaps I don't understand your question.

Things like mbind() and numa_setlocal_memory() are, I assume, Linux calls for placing some memory close to a process. I think the Solaris madvise() call does this: give a memory range and say something about how that memory should be placed -- e.g., the memory should be placed local to the next thread to touch that memory. Anyhow, I think the default policy is "first touch", so one could always do that.

I'm not an expert on this stuff, but I just wanted to reassure you that Solaris supports NUMA programming. There are interfaces for discovering the NUMA topology of a machine (there is a hierarchy of "locality groups", each containing CPUs and memory), for discovering in which locality group you are, for advising the VM system where you want memory placed, and for querying where certain memory is. I could do more homework on these matters if it'd be helpful.

Reply via email to