Actually I will be interested in this discussion.
On 9/5/08, Jeff Squyres <jsquy...@cisco.com> wrote: > > For the mailing list... > > Note that we moved this conversation to a higher bandwidth (telephone). If > others are interested, please let us know. > > > On Sep 3, 2008, at 1:21 AM, Eugene Loh wrote: > > Jeff Squyres wrote: >> >> I think even first-touch will make *the whole page* be local to the >>> process that touches it. >>> >> >> Right. >> >> So if you have each process take N bytes (where N << page_size), then >>> the 0th process will make that whole page be local; it may be remote for >>> others. >>> >> >> I think I'm not making myself clear. Read on... >> >> *) You wouldn't need to control memory allocations with a lock (except >>>> for multithreaded apps). I haven't looked at this too closely yet, but >>>> the >>>> 3*n*n memory allocations in shared memory during MPI_Init are currently >>>> serialized, which sounds disturbing when n is 100 to 500 local processes. >>>> >>> >>> If I'm understanding your proposal right, you're saying that each >>> process would create its own shared memory space, right? Then any other >>> process that wants to send to that process would mmap/shmattach/ whatever to >>> the receiver's shared memory space. Right? >>> >> >> I don't think it's necessary to have each process have its own segment. >> The OS manages the shared area on a per-page basis anyhow. All that's >> necessary is that there is an agreement up front about which pages will be >> local to which process. E.g., if there are P processes/processors and the >> shared area has M pages per process, then there will be P*M pages >> altogether. We'll say that the first M pages are local to process 0, then >> next m to process 1, etc. That is, process 0 will first-touch the first M >> pages, process 1 will first-touch the next M pages, etc. If an allocation >> needs to be local to process i, then process i will allocate it from its >> pages. Since only process i can allocate from these pages, it does not need >> any lock protection to keep other processes from allocating at the same >> time. And, since these pages have the proper locality, then small >> allocations can all share common pages (instead of having a separate page >> for each 12-byte or 64-byte allocation). >> >> Clearer? One shared memory region, partitioned equally among all >> processes. Each process first-touches its own pages to get the right >> locality. Each allocation made by the process to whom it should be local. >> Benefits include no multi-process locks and no need for page alignment of >> tiny allocations. >> >> The total amount of shared memory will likely not go down, because the >>> OS will still likely allocate on a per-page basis, right? >>> >> >> Total amount would go down significantly. Today, if you want to allocate >> 64 bytes on a page boundary, you allocate 64+pagesize, a 100x overhead. >> With what I'm (evidently not so clearly) proposing is that we establish a >> policy about what memory will be local to whom. With that policy, we simply >> allocate our 64 bytes in the appropriate region. This eliminates the need >> for page alignment (page is already in the right place, shared by many >> allocations all of whom want to be there). You could still want cacheline >> alignment... that's fine. >> >> But per your 2nd point, would the resources required for each process to >>> mmap/ shmattach/whatever 511 other process' shared memory spaces be >>> prohibitive? >>> >> >> No need to have more shared memory segments. Just need a policy to say >> how your global space is partitioned. >> >> Graham, Richard L. wrote: >>>> >>>> I have not looked at the code in a long time, so not sure how many >>>>> things have changed ... In general what you are suggesting is >>>>> reasonable. >>>>> However, especially on large machines you also need to worry about >>>>> memory >>>>> locality, so should allocate from memory pools that are appropriately >>>>> located. I expect that memory allocated on a per-socket basis would do. >>>>> >>>> >>>> Is this what "maffinity" and "memory nodes" are about? If so, I would >>>> think memory locality should be handled there rather than in page >>>> alignment >>>> of individual 12-byte and 64-byte allocations. >>>> >>> >>> maffinity was a first stab at memory affinity and is currently (and has >>> been for a long, long time) no frills and didn't have a lot of thought put >>> into it. >>> >>> I see the "node id" and "bind" functions in there; I think Gleb must >>> have added them somewhere along the way. I'm not sure how much thought >>> was put into making those be truly generic functions (I see them >>> implemented in libnuma, which AFAIK is Linux-specific). Does Solaris have >>> memory affinity function calls? >>> >> >> Yes, I believe so, though perhaps I don't understand your question. >> >> Things like mbind() and numa_setlocal_memory() are, I assume, Linux calls >> for placing some memory close to a process. I think the Solaris madvise() >> call does this: give a memory range and say something about how that memory >> should be placed -- e.g., the memory should be placed local to the next >> thread to touch that memory. Anyhow, I think the default policy is "first >> touch", so one could always do that. >> >> I'm not an expert on this stuff, but I just wanted to reassure you that >> Solaris supports NUMA programming. There are interfaces for discovering the >> NUMA topology of a machine (there is a hierarchy of "locality groups", each >> containing CPUs and memory), for discovering in which locality group you >> are, for advising the VM system where you want memory placed, and for >> querying where certain memory is. I could do more homework on these matters >> if it'd be helpful. >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >