Great analysis and suggested changes!  I've not had a chance yet
to look at your hg branch, so this sin't a code review...  Barring a
bad code review, I'd say these changes should all go in the trunk
for inclusion in 1.4.

2009/1/14 Eugene Loh <eugene....@sun.com>:
>
>
> RFC: Fragmented sm Allocations
>
> WHAT: Dealing with the fragmented allocations of sm BTL FIFO circular
> buffers (CB) during MPI_Init().
>
> Also:
>
> Improve handling of error codes.
> Automate the sizing of the mmap file.
>
> WHY: To reduce consumption of shared memory, making job startup more robust,
> and possibly improving the scalability of startup.
>
> WHERE: In mca_btl_sm_add_procs(), there is a loop over calls to
> ompi_fifo_init(). This is where CBs are initialized one at a time,
> components of a CB allocated individually. Changes can be seen in
> ssh://www.open-mpi.org/~eugene/hg/sm-allocation.
>
> WHEN: Upon acceptance.
>
> TIMEOUT: January 30, 2009.
>
> ________________________________
>
> WHY (details)
>
> The sm BTL establishes a FIFO for each non-self, on-node connection. Each
> FIFO is initialized during MPI_Init() with a circular buffer (CB). (More CBs
> can be added later in program execution if a FIFO runs out of room.)
>
> A CB has different components that are used in different ways:
>
> The "wrapper" is read by both sender and receiver, but is rarely written.
> The "queue" (FIFO entries) is accessed by both the sender and receiver.
> The "head" is accessed by the sender.
> The "tail" is accessed by the receiver.
>
> For performance reasons, a CB is not allocated as one large data structure.
> Rather, these components are laid out separately in memory and the wrapper
> has pointers to the various locations. Performance considerations include:
>
> false sharing: a component used by one process should not share a cacheline
> with another component that is modified by another process
> NUMA: certain components should perhaps be mapped preferentially to memory
> pages that are close to the processes that access these components
>
> Currently, the sm BTL handles these issues by allocating each component of
> each CB its own page. (Actually, it couples tails and queues together.)
>
> As the number of on-node processes grows, however, the shared-memory
> allocation skyrockets. E.g., let's say there are n processes on-node. There
> are therefore n(n-1) = O(n2) FIFOs, each with 3 allocations (wrapper, head,
> and tail/queue). The shared-memory allocation for CBs becomes 3n2 pages. For
> large n, this dominates the shared-memory consumption, even though most of
> the CB allocation is unused. E.g., a 12-byte "head" ends up consuming a full
> memory page!
>
> Not only is the 3n2-page allocation large, but it is also not tunable via
> any MCA parameters.
>
> Large shared-memory consumption has led to some number of start-up and other
> user problems. E.g., the e-mail thread at
> http://www.open-mpi.org/community/lists/devel/2008/11/4882.php.
>
> WHAT (details)
>
> Several actions are recommended here.
>
> 1. Cacheline Rather than Pagesize Alignment
>
> The first set of changes reduces pagesize to cacheline alignment. Though
> mapping to pages is motivated by NUMA locality, note:
>
> The code already has NUMA locality optimizations (maffinity and mpools)
> anyhow.
> There is no data that I'm aware of substantiating the benefits of locality
> optimizations in this context.
>
> More to the point, I've tried some such experiments myself. I had two
> processes communicating via shared memory on a large SMP that had a large
> difference between remote and local memory access times. I timed the
> roundtrip latency for pingpongs between the processes. That time was
> correlated to the relative separation between the two processes, and not at
> all to the placement of the physical memory backing the shared variables. It
> did not matter if the memory was local to the sender or receiver or remote
> from both! (E.g., colocal processes showed fast timings even if the shared
> memory were remote to both processes.)
>
> My results do not prove that all NUMA platforms behave in the same way. My
> point is only that, though I understand the logic behind locality
> optimizations for FIFO placement, the only data I am aware of does not
> substantiate that logic.
>
> The changes are:
>
> File: ompi/mca/mpool/sm/mpool_sm_module.c
> Function: mca_mpool_sm_alloc()
>
> Use the alignment requested by the caller rather than adding additional
> pagesize alignment as well.
>
> File: ompi/class/ompi_fifo.h
> Function: ompi_fifo_init() and ompi_fifo_write_to_head()
>
> Align ompi_cb_fifo_wrapper_t structure on cacheline rather than page.
>
> File: ompi/class/ompi_circular_buffer_fifo.h
> Function: ompi_cb_fifo_init()
>
> Align the two calls to mpool_alloc on cacheline rather than page.
>
> 2. Aggregated Allocation
>
> Another option is to lay out all the CBs at once and aggregate their
> allocations.
>
> This may have the added benefit of reducing lock contention during
> MPI_Init(). On the one hand, the 3n2 CB allocations during MPI_Init()
> contend for a single mca_common_sm_mmap->map_seg->seg_lock lock. On the
> other hand, I know so far of no data showing that this lock contention is
> impairing start-up scalability.
>
> The objectives here would be to consolidate many CB components together
> subject to:
>
> queues should be local to receivers
> heads should be local to senders
> tails should be local to receivers
> wrappers should not share cachelines with heads or tails
>
> In sum, for process myrank, the FIFO allocation in shared memory during
> MPI_Init() looks something like this:
>
>     ompi_fifo_t   from  0  to myrank
>     ompi_fifo_t   from  1  to myrank
>     ompi_fifo_t   from  2  to myrank
>     ...
>     ompi_fifo_t   from n-1 to myrank
>     --- cacheline boundary ---
>     queue of pointers, for CB from  0  to myrank
>     queue of pointers, for CB from  1  to myrank
>     queue of pointers, for CB from  2  to myrank
>     ...
>     queue of pointers, for CB from n-1 to myrank
>     --- cacheline boundary ---
>     head for CB from myrank to  0
>     tail for CB from  0  to myrank
>     head for CB from myrank to  1
>     tail for CB from  1  to myrank
>     head for CB from myrank to  2
>     tail for CB from  2  to myrank
>     ...
>     head for CB from myrank to n-1
>     tail for CB from n-1 to myrank
>     --- cacheline boundary ---
>     wrapper, CB from  0  to myrank
>     wrapper, CB from  1  to myrank
>     wrapper, CB from  2  to myrank
>     ...
>     wrapper, CB from n-1 to myrank
>
> Note that out-bound heads and in-bound tails are interwoven. There should be
> no false sharing, however, even if multiple heads and tails share the same
> cacheline, since they're all accessed by the same process.
>
> For multi-threaded processes, however, it is conceivable that different
> threads will be passing many messages to different peers. So, for
> multi-threaded jobs, we spread heads and tails out onto their own
> cachelines. Consuming the extra shared memory in the multi-threaded case is
> probably tolerable since the on-node-process count will be lower.
>
> The changes are:
>
> File: ompi/class/ompi_circular_buffer_fifo.h
> Function: ompi_cb_fifo_init()
>
> Instead of having ompi_cb_fifo_init() allocate each component of a CB
> separately, change the interface to allow the caller function to pass an
> array of addresses in. These addresses will be assumed to be allocated and
> should be used for the CB components.
>
> If the "array of addresses" is NULL, then we allocate the CB components as
> before.
>
> Here is pseudocode to illustrate this change. We replace
>
>      fifo->head = mpool_alloc(...);   /* allocate head */
>      if ( NULL == fifo->head )
>          return OMPI_ERR_OUT_OF_RESOURCE;
>
>
> with
>
>      if ( NULL != cb_addr ) {
>          fifo->head = cb_addr[1];         /* use supplied address, if
> available */
>      } else {
>          fifo->head = mpool_alloc(...);   /* allocate head */
>          if ( NULL == fifo->head )
>              return OMPI_ERR_OUT_OF_RESOURCE;
>      }
>
>
> File: ompi/class/ompi_fifo.h
> Function: ompi_fifo_init()
>
> Change the interface to ompi_fifo_init() to allow the caller to supply
> addresses to use for CB components rather than having callees allocate
> memory for them.
>
> Use such a supplied address, if available, when allocating the FIFO (not CB)
> head.
>
> Function: ompi_fifo_init()
>
> Change the call in ompi_fifo_write_to_head() to ompi_cb_fifo_init() to
> reflect the new interface, passing NULL as the new argument.
>
> File: ompi/mca/btl/sm/btl_sm.c
> Function: compute_initial_cb_fifo_space() and
> compute_initial_cb_fifo_addrs()
>
> Add these two functions to compute how much space will be needed for the
> aggregated CB allocation and to compute addresses for individual CB
> components for a particular sender and receiver.
>
> Function: sm_btl_first_time_init()
>
> Increase the allocation of FIFOs (call to mpool_calloc) to include room for
> the CBs.
>
> Function: mca_btl_sm_add_procs()
>
> Compute the addresses where CB components should be and pass them into
> ompi_fifo_init().
>
> These changes impact only the allocation of CBs during MPI_Init(). If FIFOs
> are grown later during program execution, they will continue to have
> components allocated in a fragmented manner.
>
> 3. Free List Return Codes
>
> This is unrelated to FIFOs, but is related to more robust handling of
> shared-memory allocation.
>
> The function sm_btl_first_time_init() should test the return values when it
> allocates free lists. It currently does not test return values, proceeding
> without a hiccup even if those allocations indicate an error. The proposed
> change is:
>
> File: ompi/mca/btl/sm/btl_sm.c
> Function: sm_btl_first_time_init()
>
> Test the return codes from the calls to:
>
>      ompi_free_list_init_new()
>      ompi_free_list_init_new()
>      opal_free_list_init()
>
>
> returning non-successful error status in case of error.
>
> 4. Better Automatic Sizing of mmap File
>
> Currently, the size of the file to be mmaped is governed by three MCA
> parameters:
>
> mpool_sm_max_size
> mpool_sm_min_size (default 128 Mbytes)
> mpool_sm_per_peer_size (default 32 Mbytes)
>
> Specifically, the file size is
>
>      min(mpool_sm_max_size,
>      max(mpool_sm_min_size,
>      n * mpool_sm_per_peer_size))
>
> This file size is a poor approximation for the actual amount of shared
> memory needed by an application during MPI_Init(). E.g., at n=2, the file is
> 128M even though less than 1M is needed. At large n, however, the file is
> insufficiently small.
>
> Instead, we should add code that produces a better estimate of how much
> shared memory will be needed during MPI_Init().
>
> Regarding the MCA parameters:
>
> mpool_sm_max_size: default should be 0 (no limit)
> mpool_sm_min_size: default should be 0 (no limit)
> mpool_sm_per_peer_size: should be eliminated
>
> More accurate sizing could help reduce the problems users see starting sm
> jobs with large on-node-process counts.
>
> One problem is that the size of the shared file is set by mpool_sm, but
> information about how much shared memory needs to be allocated during
> MPI_Init() is in btl_sm. Since OMPI doesn't easily allow components to call
> one another, we're stuck.
>
> Supporting Data
>
> Memory Consumption
>
> Memory consumption can be measured or modeled. (I have a byte-accurate
> model.)
>
> Here are some comparisons for the case of:
>
> 1024 on-node processes
> 8-byte pointers (LP64 execution model)
> 4K eager limit
> 32K chunk size
> 128-byte cacheline size
> 4K (Linux) or 8K (Solaris) page size
>
> Here are breakdowns of the shared-memory allocations during MPI_Init() in
> units of 106 bytes:
>
>                       pagesize alignment   cacheline
>                       ------------------   alignment
>   description        8K pages   4K pages
>   ===============
>   CB wrappers           8,682      4,391        235
>   CB queues+tails       9,822      5,531      1,374
>   CB heads              8,632      4,341        184
>   eager freelists       9,171      9,032      8,898
>   other                   370        362        355
>   ---------------
>   total                36,677     23,658     11,046
>
> That is, with pagesize alignment, the CB allocations consume over 3n2 pages
> and dominate, even though most of that space is unused.
>
> The next biggest contributor is the eager freelists. There are 2n2 eager
> fragments, each 4K (the eager limit), costing (approximately) 8 Gbytes.
>
> With cacheline alignment:
>
> Overall consumption drops by over 2× compared to 4K pages and over 3×
> compared to 8K pages.
> The remaining leading category (eager freelists) can now be scaled down by
> adjusting an existing MCA parameter btl_sm_eager_limit.
> For that matter, the second leading category (CB queues) can also be scaled
> down by adjusting an existing MCA parameter btl_sm_size_of_cb_queue.
>
> Here are results when we not only drop from page-size to cacheline
> alignment, but we also aggregate CB allocations:
>
>   106 bytes          description
>   =========          ===============
>       1,250          FIFOs and CBs
>       8,899          eager freelists
>         270          max freelists
>      ------          ---------------
>      10,418          total
>
> With no more pagesize dependence and little more cacheline dependence, one
> could really start to shoehorn big jobs into a small shared-memory area.
> E.g., consider bumping the eager limit down to 256 bytes, the size of a CB
> queue to 16 entries, and the chunk size to 8K. Then, shared-memory
> consumption for 1024 processes looks like this:
>
>   106 bytes          description
>   =========          ===============
>         311          FIFOs and CBs
>         544          eager freelists
>          68          max freelists
>      ------          ---------------
>         924          total
>
> Ping-Pong Latency
>
> We can also look at performance. Here are OSU latency results for short
> messages on a Sun v20z. The absolute numbers are less important than the
> relative difference between the two sets:
>
>      bytes  before     after
>
>        0      0.85     0.84  µsec
>        1      0.97     0.99
>        2      0.97     0.98
>        4      0.97     0.98
>        8      0.97     0.99
>
> There is a penalty for non-null messages due to OMPI "data convertors".
> Importantly, to within the reproducibility of the measurements, it is
> unclear if there is any slowdown that one can attribute to the changes.
> (Results are the median of 5 measurements. The values look smooth, but the
> error bars, which are difficult to characterize, are probably greater than
> the 0.01-0.02 µsec differences seen here.)
>
> Other Considerations
>
> Simply going from pagesize alignment to cacheline alignment should be a
> relatively unintrusive code change and effect most of the reduction in
> shared-memory allocation.
>
> Also aggregating allocations is more intrusive, but has a few more
> advantages, including:
>
> Slight further reduction in the amount of shared memory allocated.
> Less lock contention as the number of allocation is reduced from O(n2) to
> O(n), possibly improving start-up performance.
> Can provide memory locality even on systems that don't have maffinity
> support.
>
> It would be nice to size the mmap file automatically better than what is
> done today, but (as noted) I haven't yet figured out how to make the btl_sm
> and mpool_sm components talk to each other.
>
> My proposed code changes need more testing, especially in the case of
> multiple memory nodes per node.
>
> It remains unclear to me if error codes are being treated properly in the
> mca_btl_sm_add_procs() code. E.g., if one process is unable to allocate
> memory in the shared area, should all processes fail?
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
    I'm a bright... http://www.the-brights.net/

Reply via email to