Re: [OMPI devel] RFC: Fragmented sm Allocations

Ralph Castain Wed, 14 Jan 2009 08:54:26 -0500

I haven't reviewed the code either, but really do appreciate someonetaking the time for such a thorough analysis of the problems we haveall observed for some time!


Thanks Eugene!!



On Jan 14, 2009, at 5:05 AM, Tim Mattox wrote:

Great analysis and suggested changes!  I've not had a chance yet
to look at your hg branch, so this sin't a code review...  Barring a
bad code review, I'd say these changes should all go in the trunk
for inclusion in 1.4.

2009/1/14 Eugene Loh <eugene....@sun.com>:
RFC: Fragmented sm Allocations

WHAT: Dealing with the fragmented allocations of sm BTL FIFO circular
buffers (CB) during MPI_Init().

Also:

Improve handling of error codes.
Automate the sizing of the mmap file.
WHY: To reduce consumption of shared memory, making job startupmore robust,
and possibly improving the scalability of startup.

WHERE: In mca_btl_sm_add_procs(), there is a loop over calls to
ompi_fifo_init(). This is where CBs are initialized one at a time,
components of a CB allocated individually. Changes can be seen in
ssh://www.open-mpi.org/~eugene/hg/sm-allocation.

WHEN: Upon acceptance.

TIMEOUT: January 30, 2009.

________________________________

WHY (details)
The sm BTL establishes a FIFO for each non-self, on-nodeconnection. EachFIFO is initialized during MPI_Init() with a circular buffer (CB).(More CBs
can be added later in program execution if a FIFO runs out of room.)

A CB has different components that are used in different ways:
The "wrapper" is read by both sender and receiver, but is rarelywritten.The "queue" (FIFO entries) is accessed by both the sender andreceiver.
The "head" is accessed by the sender.
The "tail" is accessed by the receiver.
For performance reasons, a CB is not allocated as one large datastructure.Rather, these components are laid out separately in memory and thewrapperhas pointers to the various locations. Performance considerationsinclude:
false sharing: a component used by one process should not share acacheline
with another component that is modified by another process
NUMA: certain components should perhaps be mapped preferentially tomemory
pages that are close to the processes that access these components
Currently, the sm BTL handles these issues by allocating eachcomponent ofeach CB its own page. (Actually, it couples tails and queuestogether.)
As the number of on-node processes grows, however, the shared-memory
allocation skyrockets. E.g., let's say there are n processes on-node. Thereare therefore n(n-1) = O(n2) FIFOs, each with 3 allocations(wrapper, head,and tail/queue). The shared-memory allocation for CBs becomes 3n2pages. Forlarge n, this dominates the shared-memory consumption, even thoughmost ofthe CB allocation is unused. E.g., a 12-byte "head" ends upconsuming a full
memory page!
Not only is the 3n2-page allocation large, but it is also nottunable via
any MCA parameters.
Large shared-memory consumption has led to some number of start-upand other
user problems. E.g., the e-mail thread at
http://www.open-mpi.org/community/lists/devel/2008/11/4882.php.

WHAT (details)

Several actions are recommended here.

1. Cacheline Rather than Pagesize Alignment
The first set of changes reduces pagesize to cacheline alignment.Though
mapping to pages is motivated by NUMA locality, note:
The code already has NUMA locality optimizations (maffinity andmpools)
anyhow.
There is no data that I'm aware of substantiating the benefits oflocality
optimizations in this context.

More to the point, I've tried some such experiments myself. I had two
processes communicating via shared memory on a large SMP that had alarge
difference between remote and local memory access times. I timed the
roundtrip latency for pingpongs between the processes. That time was
correlated to the relative separation between the two processes,and not atall to the placement of the physical memory backing the sharedvariables. Itdid not matter if the memory was local to the sender or receiver orremotefrom both! (E.g., colocal processes showed fast timings even if theshared
memory were remote to both processes.)
My results do not prove that all NUMA platforms behave in the sameway. My
point is only that, though I understand the logic behind locality
optimizations for FIFO placement, the only data I am aware of doesnot
substantiate that logic.

The changes are:

File: ompi/mca/mpool/sm/mpool_sm_module.c
Function: mca_mpool_sm_alloc()
Use the alignment requested by the caller rather than addingadditional
pagesize alignment as well.

File: ompi/class/ompi_fifo.h
Function: ompi_fifo_init() and ompi_fifo_write_to_head()

Align ompi_cb_fifo_wrapper_t structure on cacheline rather than page.

File: ompi/class/ompi_circular_buffer_fifo.h
Function: ompi_cb_fifo_init()

Align the two calls to mpool_alloc on cacheline rather than page.

2. Aggregated Allocation

Another option is to lay out all the CBs at once and aggregate their
allocations.

This may have the added benefit of reducing lock contention during
MPI_Init(). On the one hand, the 3n2 CB allocations during MPI_Init()
contend for a single mca_common_sm_mmap->map_seg->seg_lock lock. Ontheother hand, I know so far of no data showing that this lockcontention is
impairing start-up scalability.
The objectives here would be to consolidate many CB componentstogether
subject to:

queues should be local to receivers
heads should be local to senders
tails should be local to receivers
wrappers should not share cachelines with heads or tails
In sum, for process myrank, the FIFO allocation in shared memoryduring
MPI_Init() looks something like this:

   ompi_fifo_t   from  0  to myrank
   ompi_fifo_t   from  1  to myrank
   ompi_fifo_t   from  2  to myrank
   ...
   ompi_fifo_t   from n-1 to myrank
   --- cacheline boundary ---
   queue of pointers, for CB from  0  to myrank
   queue of pointers, for CB from  1  to myrank
   queue of pointers, for CB from  2  to myrank
   ...
   queue of pointers, for CB from n-1 to myrank
   --- cacheline boundary ---
   head for CB from myrank to  0
   tail for CB from  0  to myrank
   head for CB from myrank to  1
   tail for CB from  1  to myrank
   head for CB from myrank to  2
   tail for CB from  2  to myrank
   ...
   head for CB from myrank to n-1
   tail for CB from n-1 to myrank
   --- cacheline boundary ---
   wrapper, CB from  0  to myrank
   wrapper, CB from  1  to myrank
   wrapper, CB from  2  to myrank
   ...
   wrapper, CB from n-1 to myrank
Note that out-bound heads and in-bound tails are interwoven. Thereshould beno false sharing, however, even if multiple heads and tails sharethe same
cacheline, since they're all accessed by the same process.
For multi-threaded processes, however, it is conceivable thatdifferent
threads will be passing many messages to different peers. So, for
multi-threaded jobs, we spread heads and tails out onto their own
cachelines. Consuming the extra shared memory in the multi-threadedcase is
probably tolerable since the on-node-process count will be lower.

The changes are:

File: ompi/class/ompi_circular_buffer_fifo.h
Function: ompi_cb_fifo_init()

Instead of having ompi_cb_fifo_init() allocate each component of a CB
separately, change the interface to allow the caller function topass anarray of addresses in. These addresses will be assumed to beallocated and
should be used for the CB components.
If the "array of addresses" is NULL, then we allocate the CBcomponents as
before.

Here is pseudocode to illustrate this change. We replace

    fifo->head = mpool_alloc(...);   /* allocate head */
    if ( NULL == fifo->head )
        return OMPI_ERR_OUT_OF_RESOURCE;


with

    if ( NULL != cb_addr ) {
        fifo->head = cb_addr[1];         /* use supplied address, if
available */
    } else {
        fifo->head = mpool_alloc(...);   /* allocate head */
        if ( NULL == fifo->head )
            return OMPI_ERR_OUT_OF_RESOURCE;
    }


File: ompi/class/ompi_fifo.h
Function: ompi_fifo_init()
Change the interface to ompi_fifo_init() to allow the caller tosupplyaddresses to use for CB components rather than having calleesallocate
memory for them.
Use such a supplied address, if available, when allocating the FIFO(not CB)
head.

Function: ompi_fifo_init()
Change the call in ompi_fifo_write_to_head() to ompi_cb_fifo_init()to
reflect the new interface, passing NULL as the new argument.

File: ompi/mca/btl/sm/btl_sm.c
Function: compute_initial_cb_fifo_space() and
compute_initial_cb_fifo_addrs()
Add these two functions to compute how much space will be neededfor the
aggregated CB allocation and to compute addresses for individual CB
components for a particular sender and receiver.

Function: sm_btl_first_time_init()
Increase the allocation of FIFOs (call to mpool_calloc) to includeroom for
the CBs.

Function: mca_btl_sm_add_procs()
Compute the addresses where CB components should be and pass theminto
ompi_fifo_init().
These changes impact only the allocation of CBs during MPI_Init().If FIFOs
are grown later during program execution, they will continue to have
components allocated in a fragmented manner.

3. Free List Return Codes

This is unrelated to FIFOs, but is related to more robust handling of
shared-memory allocation.
The function sm_btl_first_time_init() should test the return valueswhen itallocates free lists. It currently does not test return values,proceedingwithout a hiccup even if those allocations indicate an error. Theproposed
change is:

File: ompi/mca/btl/sm/btl_sm.c
Function: sm_btl_first_time_init()

Test the return codes from the calls to:

    ompi_free_list_init_new()
    ompi_free_list_init_new()
    opal_free_list_init()


returning non-successful error status in case of error.

4. Better Automatic Sizing of mmap File

Currently, the size of the file to be mmaped is governed by three MCA
parameters:

mpool_sm_max_size
mpool_sm_min_size (default 128 Mbytes)
mpool_sm_per_peer_size (default 32 Mbytes)

Specifically, the file size is

    min(mpool_sm_max_size,
    max(mpool_sm_min_size,
    n * mpool_sm_per_peer_size))
This file size is a poor approximation for the actual amount ofsharedmemory needed by an application during MPI_Init(). E.g., at n=2,the file is128M even though less than 1M is needed. At large n, however, thefile is
insufficiently small.
Instead, we should add code that produces a better estimate of howmuch
shared memory will be needed during MPI_Init().

Regarding the MCA parameters:

mpool_sm_max_size: default should be 0 (no limit)
mpool_sm_min_size: default should be 0 (no limit)
mpool_sm_per_peer_size: should be eliminated
More accurate sizing could help reduce the problems users seestarting sm
jobs with large on-node-process counts.
One problem is that the size of the shared file is set by mpool_sm,but
information about how much shared memory needs to be allocated during
MPI_Init() is in btl_sm. Since OMPI doesn't easily allow componentsto call
one another, we're stuck.

Supporting Data

Memory Consumption
Memory consumption can be measured or modeled. (I have a byte-accurate
model.)

Here are some comparisons for the case of:

1024 on-node processes
8-byte pointers (LP64 execution model)
4K eager limit
32K chunk size
128-byte cacheline size
4K (Linux) or 8K (Solaris) page size
Here are breakdowns of the shared-memory allocations duringMPI_Init() in
units of 106 bytes:

                     pagesize alignment   cacheline
                     ------------------   alignment
 description        8K pages   4K pages
 ===============
 CB wrappers           8,682      4,391        235
 CB queues+tails       9,822      5,531      1,374
 CB heads              8,632      4,341        184
 eager freelists       9,171      9,032      8,898
 other                   370        362        355
 ---------------
 total                36,677     23,658     11,046
That is, with pagesize alignment, the CB allocations consume over3n2 pages
and dominate, even though most of that space is unused.
The next biggest contributor is the eager freelists. There are 2n2eagerfragments, each 4K (the eager limit), costing (approximately) 8Gbytes.
With cacheline alignment:

Overall consumption drops by over 2× compared to 4K pages and over 3×
compared to 8K pages.
The remaining leading category (eager freelists) can now be scaleddown by
adjusting an existing MCA parameter btl_sm_eager_limit.
For that matter, the second leading category (CB queues) can alsobe scaled
down by adjusting an existing MCA parameter btl_sm_size_of_cb_queue.

Here are results when we not only drop from page-size to cacheline
alignment, but we also aggregate CB allocations:

 106 bytes          description
 =========          ===============
     1,250          FIFOs and CBs
     8,899          eager freelists
       270          max freelists
    ------          ---------------
    10,418          total
With no more pagesize dependence and little more cachelinedependence, onecould really start to shoehorn big jobs into a small shared-memoryarea.E.g., consider bumping the eager limit down to 256 bytes, the sizeof a CB
queue to 16 entries, and the chunk size to 8K. Then, shared-memory
consumption for 1024 processes looks like this:

 106 bytes          description
 =========          ===============
       311          FIFOs and CBs
       544          eager freelists
        68          max freelists
    ------          ---------------
       924          total

Ping-Pong Latency
We can also look at performance. Here are OSU latency results forshortmessages on a Sun v20z. The absolute numbers are less importantthan the
relative difference between the two sets:

    bytes  before     after

      0      0.85     0.84  µsec
      1      0.97     0.99
      2      0.97     0.98
      4      0.97     0.98
      8      0.97     0.99
There is a penalty for non-null messages due to OMPI "dataconvertors".
Importantly, to within the reproducibility of the measurements, it is
unclear if there is any slowdown that one can attribute to thechanges.(Results are the median of 5 measurements. The values look smooth,but theerror bars, which are difficult to characterize, are probablygreater than
the 0.01-0.02 µsec differences seen here.)

Other Considerations
Simply going from pagesize alignment to cacheline alignment shouldbe arelatively unintrusive code change and effect most of the reductionin
shared-memory allocation.

Also aggregating allocations is more intrusive, but has a few more
advantages, including:

Slight further reduction in the amount of shared memory allocated.
Less lock contention as the number of allocation is reduced fromO(n2) to
O(n), possibly improving start-up performance.
Can provide memory locality even on systems that don't have maffinity
support.
It would be nice to size the mmap file automatically better thanwhat isdone today, but (as noted) I haven't yet figured out how to makethe btl_sm
and mpool_sm components talk to each other.

My proposed code changes need more testing, especially in the case of
multiple memory nodes per node.
It remains unclear to me if error codes are being treated properlyin themca_btl_sm_add_procs() code. E.g., if one process is unable toallocate
memory in the shared area, should all processes fail?

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmat...@gmail.com || timat...@open-mpi.org
   I'm a bright... http://www.the-brights.net/

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC: Fragmented sm Allocations

Reply via email to