Re: [OMPI devel] allocating sm memory with page alignment

Jeff Squyres Fri, 5 Sep 2008 13:40:22 -0400

For the mailing list...

Note that we moved this conversation to a higher bandwidth(telephone). If others are interested, please let us know.



On Sep 3, 2008, at 1:21 AM, Eugene Loh wrote:

Jeff Squyres wrote:
I think even first-touch will make *the whole page* be local tothe process that touches it.
Right.
So if you have each process take N bytes (where N << page_size),then the 0th process will make that whole page be local; it may beremote for others.
I think I'm not making myself clear.  Read on...
*) You wouldn't need to control memory allocations with a lock(except for multithreaded apps). I haven't looked at this tooclosely yet, but the 3*n*n memory allocations in shared memoryduring MPI_Init are currently serialized, which sounds disturbingwhen n is 100 to 500 local processes.
If I'm understanding your proposal right, you're saying that eachprocess would create its own shared memory space, right? Then anyother process that wants to send to that process would mmap/shmattach/ whatever to the receiver's shared memory space. Right?
I don't think it's necessary to have each process have its ownsegment. The OS manages the shared area on a per-page basisanyhow. All that's necessary is that there is an agreement up frontabout which pages will be local to which process. E.g., if thereare P processes/processors and the shared area has M pages perprocess, then there will be P*M pages altogether. We'll say thatthe first M pages are local to process 0, then next m to process 1,etc. That is, process 0 will first-touch the first M pages, process1 will first-touch the next M pages, etc. If an allocation needs tobe local to process i, then process i will allocate it from itspages. Since only process i can allocate from these pages, it doesnot need any lock protection to keep other processes from allocatingat the same time. And, since these pages have the proper locality,then small allocations can all share common pages (instead of havinga separate page for each 12-byte or 64-byte allocation).
Clearer? One shared memory region, partitioned equally among allprocesses. Each process first-touches its own pages to get theright locality. Each allocation made by the process to whom itshould be local. Benefits include no multi-process locks and noneed for page alignment of tiny allocations.
The total amount of shared memory will likely not go down, becausethe OS will still likely allocate on a per-page basis, right?
Total amount would go down significantly. Today, if you want toallocate 64 bytes on a page boundary, you allocate 64+pagesize, a100x overhead. With what I'm (evidently not so clearly) proposingis that we establish a policy about what memory will be local towhom. With that policy, we simply allocate our 64 bytes in theappropriate region. This eliminates the need for page alignment(page is already in the right place, shared by many allocations allof whom want to be there). You could still want cachelinealignment... that's fine.
But per your 2nd point, would the resources required for eachprocess to mmap/ shmattach/whatever 511 other process' sharedmemory spaces be prohibitive?
No need to have more shared memory segments. Just need a policy tosay how your global space is partitioned.
Graham, Richard L. wrote:
I have not looked at the code in a long time, so not sure howmany things have changed ... In general what you are suggestingis reasonable. However, especially on large machines you alsoneed to worry about memory locality, so should allocate frommemory pools that are appropriately located. I expect thatmemory allocated on a per-socket basis would do.
Is this what "maffinity" and "memory nodes" are about? If so, Iwould think memory locality should be handled there rather thanin page alignment of individual 12-byte and 64-byte allocations.
maffinity was a first stab at memory affinity and is currently(and has been for a long, long time) no frills and didn't have alot of thought put into it.
I see the "node id" and "bind" functions in there; I think Glebmust have added them somewhere along the way. I'm not sure howmuch thought was put into making those be truly generic functions(I see them implemented in libnuma, which AFAIK is Linux-specific). Does Solaris have memory affinity function calls?
Yes, I believe so, though perhaps I don't understand your question.
Things like mbind() and numa_setlocal_memory() are, I assume, Linuxcalls for placing some memory close to a process. I think theSolaris madvise() call does this: give a memory range and saysomething about how that memory should be placed -- e.g., the memoryshould be placed local to the next thread to touch that memory.Anyhow, I think the default policy is "first touch", so one couldalways do that.
I'm not an expert on this stuff, but I just wanted to reassure youthat Solaris supports NUMA programming. There are interfaces fordiscovering the NUMA topology of a machine (there is a hierarchy of"locality groups", each containing CPUs and memory), for discoveringin which locality group you are, for advising the VM system whereyou want memory placed, and for querying where certain memory is. Icould do more homework on these matters if it'd be helpful.
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] allocating sm memory with page alignment

Reply via email to