Actually I will be interested in this discussion.

On 9/5/08, Jeff Squyres <jsquy...@cisco.com> wrote:
>
> For the mailing list...
>
> Note that we moved this conversation to a higher bandwidth (telephone).  If
> others are interested, please let us know.
>
>
> On Sep 3, 2008, at 1:21 AM, Eugene Loh wrote:
>
>  Jeff Squyres wrote:
>>
>>  I think even first-touch will make *the whole page* be local to the
>>>  process that touches it.
>>>
>>
>> Right.
>>
>>  So if you have each process take N bytes  (where N << page_size), then
>>> the 0th process will make that whole page  be local; it may be remote for
>>> others.
>>>
>>
>> I think I'm not making myself clear.  Read on...
>>
>>  *) You wouldn't need to control memory allocations with a lock  (except
>>>> for multithreaded apps).  I haven't looked at this too  closely yet, but 
>>>> the
>>>> 3*n*n memory allocations in shared memory  during MPI_Init are currently
>>>> serialized, which sounds disturbing  when n is 100 to 500 local processes.
>>>>
>>>
>>> If I'm understanding your proposal right, you're saying that each
>>>  process would create its own shared memory space, right?  Then any  other
>>> process that wants to send to that process would mmap/shmattach/ whatever to
>>> the receiver's shared memory space.  Right?
>>>
>>
>> I don't think it's necessary to have each process have its own segment.
>>  The OS manages the shared area on a per-page basis anyhow.  All that's
>> necessary is that there is an agreement up front about which pages will be
>> local to which process.  E.g., if there are P processes/processors and the
>> shared area has M pages per process, then there will be P*M pages
>> altogether.  We'll say that the first M pages are local to process 0, then
>> next m to process 1, etc.  That is, process 0 will first-touch the first M
>> pages, process 1 will first-touch the next M pages, etc.  If an allocation
>> needs to be local to process i, then process i will allocate it from its
>> pages.  Since only process i can allocate from these pages, it does not need
>> any lock protection to keep other processes from allocating at the same
>> time.  And, since these pages have the proper locality, then small
>> allocations can all share common pages (instead of having a separate page
>> for each 12-byte or 64-byte allocation).
>>
>> Clearer?  One shared memory region, partitioned equally among all
>> processes.  Each process first-touches its own pages to get the right
>> locality.  Each allocation made by the process to whom it should be local.
>>  Benefits include no multi-process locks and no need for page alignment of
>> tiny allocations.
>>
>>  The total amount of shared memory will likely not go down, because the
>>>  OS will still likely allocate on a per-page basis, right?
>>>
>>
>> Total amount would go down significantly.  Today, if you want to allocate
>> 64 bytes on a page boundary, you allocate 64+pagesize, a 100x overhead.
>>  With what I'm (evidently not so clearly) proposing is that we establish a
>> policy about what memory will be local to whom.  With that policy, we simply
>> allocate our 64 bytes in the appropriate region.  This eliminates the need
>> for page alignment (page is already in the right place, shared by many
>> allocations all of whom want to be there).  You could still want cacheline
>> alignment... that's fine.
>>
>>  But per  your 2nd point, would the resources required for each process to
>>> mmap/ shmattach/whatever 511 other process' shared memory spaces be
>>>  prohibitive?
>>>
>>
>> No need to have more shared memory segments.  Just need a policy to say
>> how your global space is partitioned.
>>
>>  Graham, Richard L. wrote:
>>>>
>>>>  I have not looked at the code in a long time, so not sure how many
>>>>>  things have changed ...  In general what you are suggesting is  
>>>>> reasonable.
>>>>>  However, especially on large machines you also need to  worry about 
>>>>> memory
>>>>> locality, so should allocate from memory pools  that are appropriately
>>>>> located.  I expect that memory allocated on  a per-socket basis would do.
>>>>>
>>>>
>>>> Is this what "maffinity" and "memory nodes" are about?  If so, I  would
>>>> think memory locality should be handled there rather than in  page 
>>>> alignment
>>>> of individual 12-byte and 64-byte allocations.
>>>>
>>>
>>> maffinity was a first stab at memory affinity and is currently (and  has
>>> been for a long, long time) no frills and didn't have a lot of  thought put
>>> into it.
>>>
>>> I see the "node id" and "bind" functions in there; I think Gleb must
>>>  have added them somewhere along the way.  I'm not sure how much  thought
>>> was put into making those be truly generic functions (I see  them
>>> implemented in libnuma, which AFAIK is Linux-specific).  Does  Solaris have
>>> memory affinity function calls?
>>>
>>
>> Yes, I believe so, though perhaps I don't understand your question.
>>
>> Things like mbind() and numa_setlocal_memory() are, I assume, Linux calls
>> for placing some memory close to a process.  I think the Solaris madvise()
>> call does this:  give a memory range and say something about how that memory
>> should be placed -- e.g., the memory should be placed local to the next
>> thread to touch that memory.  Anyhow, I think the default policy is "first
>> touch", so one could always do that.
>>
>> I'm not an expert on this stuff, but I just wanted to reassure you that
>> Solaris supports NUMA programming.  There are interfaces for discovering the
>> NUMA topology of a machine (there is a hierarchy of "locality groups", each
>> containing CPUs and memory), for discovering in which locality group you
>> are, for advising the VM system where you want memory placed, and for
>> querying where certain memory is.  I could do more homework on these matters
>> if it'd be helpful.
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Reply via email to