Great analysis and suggested changes! I've not had a chance yet to look at your hg branch, so this sin't a code review... Barring a bad code review, I'd say these changes should all go in the trunk for inclusion in 1.4.
2009/1/14 Eugene Loh <eugene....@sun.com>: > > > RFC: Fragmented sm Allocations > > WHAT: Dealing with the fragmented allocations of sm BTL FIFO circular > buffers (CB) during MPI_Init(). > > Also: > > Improve handling of error codes. > Automate the sizing of the mmap file. > > WHY: To reduce consumption of shared memory, making job startup more robust, > and possibly improving the scalability of startup. > > WHERE: In mca_btl_sm_add_procs(), there is a loop over calls to > ompi_fifo_init(). This is where CBs are initialized one at a time, > components of a CB allocated individually. Changes can be seen in > ssh://www.open-mpi.org/~eugene/hg/sm-allocation. > > WHEN: Upon acceptance. > > TIMEOUT: January 30, 2009. > > ________________________________ > > WHY (details) > > The sm BTL establishes a FIFO for each non-self, on-node connection. Each > FIFO is initialized during MPI_Init() with a circular buffer (CB). (More CBs > can be added later in program execution if a FIFO runs out of room.) > > A CB has different components that are used in different ways: > > The "wrapper" is read by both sender and receiver, but is rarely written. > The "queue" (FIFO entries) is accessed by both the sender and receiver. > The "head" is accessed by the sender. > The "tail" is accessed by the receiver. > > For performance reasons, a CB is not allocated as one large data structure. > Rather, these components are laid out separately in memory and the wrapper > has pointers to the various locations. Performance considerations include: > > false sharing: a component used by one process should not share a cacheline > with another component that is modified by another process > NUMA: certain components should perhaps be mapped preferentially to memory > pages that are close to the processes that access these components > > Currently, the sm BTL handles these issues by allocating each component of > each CB its own page. (Actually, it couples tails and queues together.) > > As the number of on-node processes grows, however, the shared-memory > allocation skyrockets. E.g., let's say there are n processes on-node. There > are therefore n(n-1) = O(n2) FIFOs, each with 3 allocations (wrapper, head, > and tail/queue). The shared-memory allocation for CBs becomes 3n2 pages. For > large n, this dominates the shared-memory consumption, even though most of > the CB allocation is unused. E.g., a 12-byte "head" ends up consuming a full > memory page! > > Not only is the 3n2-page allocation large, but it is also not tunable via > any MCA parameters. > > Large shared-memory consumption has led to some number of start-up and other > user problems. E.g., the e-mail thread at > http://www.open-mpi.org/community/lists/devel/2008/11/4882.php. > > WHAT (details) > > Several actions are recommended here. > > 1. Cacheline Rather than Pagesize Alignment > > The first set of changes reduces pagesize to cacheline alignment. Though > mapping to pages is motivated by NUMA locality, note: > > The code already has NUMA locality optimizations (maffinity and mpools) > anyhow. > There is no data that I'm aware of substantiating the benefits of locality > optimizations in this context. > > More to the point, I've tried some such experiments myself. I had two > processes communicating via shared memory on a large SMP that had a large > difference between remote and local memory access times. I timed the > roundtrip latency for pingpongs between the processes. That time was > correlated to the relative separation between the two processes, and not at > all to the placement of the physical memory backing the shared variables. It > did not matter if the memory was local to the sender or receiver or remote > from both! (E.g., colocal processes showed fast timings even if the shared > memory were remote to both processes.) > > My results do not prove that all NUMA platforms behave in the same way. My > point is only that, though I understand the logic behind locality > optimizations for FIFO placement, the only data I am aware of does not > substantiate that logic. > > The changes are: > > File: ompi/mca/mpool/sm/mpool_sm_module.c > Function: mca_mpool_sm_alloc() > > Use the alignment requested by the caller rather than adding additional > pagesize alignment as well. > > File: ompi/class/ompi_fifo.h > Function: ompi_fifo_init() and ompi_fifo_write_to_head() > > Align ompi_cb_fifo_wrapper_t structure on cacheline rather than page. > > File: ompi/class/ompi_circular_buffer_fifo.h > Function: ompi_cb_fifo_init() > > Align the two calls to mpool_alloc on cacheline rather than page. > > 2. Aggregated Allocation > > Another option is to lay out all the CBs at once and aggregate their > allocations. > > This may have the added benefit of reducing lock contention during > MPI_Init(). On the one hand, the 3n2 CB allocations during MPI_Init() > contend for a single mca_common_sm_mmap->map_seg->seg_lock lock. On the > other hand, I know so far of no data showing that this lock contention is > impairing start-up scalability. > > The objectives here would be to consolidate many CB components together > subject to: > > queues should be local to receivers > heads should be local to senders > tails should be local to receivers > wrappers should not share cachelines with heads or tails > > In sum, for process myrank, the FIFO allocation in shared memory during > MPI_Init() looks something like this: > > ompi_fifo_t from 0 to myrank > ompi_fifo_t from 1 to myrank > ompi_fifo_t from 2 to myrank > ... > ompi_fifo_t from n-1 to myrank > --- cacheline boundary --- > queue of pointers, for CB from 0 to myrank > queue of pointers, for CB from 1 to myrank > queue of pointers, for CB from 2 to myrank > ... > queue of pointers, for CB from n-1 to myrank > --- cacheline boundary --- > head for CB from myrank to 0 > tail for CB from 0 to myrank > head for CB from myrank to 1 > tail for CB from 1 to myrank > head for CB from myrank to 2 > tail for CB from 2 to myrank > ... > head for CB from myrank to n-1 > tail for CB from n-1 to myrank > --- cacheline boundary --- > wrapper, CB from 0 to myrank > wrapper, CB from 1 to myrank > wrapper, CB from 2 to myrank > ... > wrapper, CB from n-1 to myrank > > Note that out-bound heads and in-bound tails are interwoven. There should be > no false sharing, however, even if multiple heads and tails share the same > cacheline, since they're all accessed by the same process. > > For multi-threaded processes, however, it is conceivable that different > threads will be passing many messages to different peers. So, for > multi-threaded jobs, we spread heads and tails out onto their own > cachelines. Consuming the extra shared memory in the multi-threaded case is > probably tolerable since the on-node-process count will be lower. > > The changes are: > > File: ompi/class/ompi_circular_buffer_fifo.h > Function: ompi_cb_fifo_init() > > Instead of having ompi_cb_fifo_init() allocate each component of a CB > separately, change the interface to allow the caller function to pass an > array of addresses in. These addresses will be assumed to be allocated and > should be used for the CB components. > > If the "array of addresses" is NULL, then we allocate the CB components as > before. > > Here is pseudocode to illustrate this change. We replace > > fifo->head = mpool_alloc(...); /* allocate head */ > if ( NULL == fifo->head ) > return OMPI_ERR_OUT_OF_RESOURCE; > > > with > > if ( NULL != cb_addr ) { > fifo->head = cb_addr[1]; /* use supplied address, if > available */ > } else { > fifo->head = mpool_alloc(...); /* allocate head */ > if ( NULL == fifo->head ) > return OMPI_ERR_OUT_OF_RESOURCE; > } > > > File: ompi/class/ompi_fifo.h > Function: ompi_fifo_init() > > Change the interface to ompi_fifo_init() to allow the caller to supply > addresses to use for CB components rather than having callees allocate > memory for them. > > Use such a supplied address, if available, when allocating the FIFO (not CB) > head. > > Function: ompi_fifo_init() > > Change the call in ompi_fifo_write_to_head() to ompi_cb_fifo_init() to > reflect the new interface, passing NULL as the new argument. > > File: ompi/mca/btl/sm/btl_sm.c > Function: compute_initial_cb_fifo_space() and > compute_initial_cb_fifo_addrs() > > Add these two functions to compute how much space will be needed for the > aggregated CB allocation and to compute addresses for individual CB > components for a particular sender and receiver. > > Function: sm_btl_first_time_init() > > Increase the allocation of FIFOs (call to mpool_calloc) to include room for > the CBs. > > Function: mca_btl_sm_add_procs() > > Compute the addresses where CB components should be and pass them into > ompi_fifo_init(). > > These changes impact only the allocation of CBs during MPI_Init(). If FIFOs > are grown later during program execution, they will continue to have > components allocated in a fragmented manner. > > 3. Free List Return Codes > > This is unrelated to FIFOs, but is related to more robust handling of > shared-memory allocation. > > The function sm_btl_first_time_init() should test the return values when it > allocates free lists. It currently does not test return values, proceeding > without a hiccup even if those allocations indicate an error. The proposed > change is: > > File: ompi/mca/btl/sm/btl_sm.c > Function: sm_btl_first_time_init() > > Test the return codes from the calls to: > > ompi_free_list_init_new() > ompi_free_list_init_new() > opal_free_list_init() > > > returning non-successful error status in case of error. > > 4. Better Automatic Sizing of mmap File > > Currently, the size of the file to be mmaped is governed by three MCA > parameters: > > mpool_sm_max_size > mpool_sm_min_size (default 128 Mbytes) > mpool_sm_per_peer_size (default 32 Mbytes) > > Specifically, the file size is > > min(mpool_sm_max_size, > max(mpool_sm_min_size, > n * mpool_sm_per_peer_size)) > > This file size is a poor approximation for the actual amount of shared > memory needed by an application during MPI_Init(). E.g., at n=2, the file is > 128M even though less than 1M is needed. At large n, however, the file is > insufficiently small. > > Instead, we should add code that produces a better estimate of how much > shared memory will be needed during MPI_Init(). > > Regarding the MCA parameters: > > mpool_sm_max_size: default should be 0 (no limit) > mpool_sm_min_size: default should be 0 (no limit) > mpool_sm_per_peer_size: should be eliminated > > More accurate sizing could help reduce the problems users see starting sm > jobs with large on-node-process counts. > > One problem is that the size of the shared file is set by mpool_sm, but > information about how much shared memory needs to be allocated during > MPI_Init() is in btl_sm. Since OMPI doesn't easily allow components to call > one another, we're stuck. > > Supporting Data > > Memory Consumption > > Memory consumption can be measured or modeled. (I have a byte-accurate > model.) > > Here are some comparisons for the case of: > > 1024 on-node processes > 8-byte pointers (LP64 execution model) > 4K eager limit > 32K chunk size > 128-byte cacheline size > 4K (Linux) or 8K (Solaris) page size > > Here are breakdowns of the shared-memory allocations during MPI_Init() in > units of 106 bytes: > > pagesize alignment cacheline > ------------------ alignment > description 8K pages 4K pages > =============== > CB wrappers 8,682 4,391 235 > CB queues+tails 9,822 5,531 1,374 > CB heads 8,632 4,341 184 > eager freelists 9,171 9,032 8,898 > other 370 362 355 > --------------- > total 36,677 23,658 11,046 > > That is, with pagesize alignment, the CB allocations consume over 3n2 pages > and dominate, even though most of that space is unused. > > The next biggest contributor is the eager freelists. There are 2n2 eager > fragments, each 4K (the eager limit), costing (approximately) 8 Gbytes. > > With cacheline alignment: > > Overall consumption drops by over 2× compared to 4K pages and over 3× > compared to 8K pages. > The remaining leading category (eager freelists) can now be scaled down by > adjusting an existing MCA parameter btl_sm_eager_limit. > For that matter, the second leading category (CB queues) can also be scaled > down by adjusting an existing MCA parameter btl_sm_size_of_cb_queue. > > Here are results when we not only drop from page-size to cacheline > alignment, but we also aggregate CB allocations: > > 106 bytes description > ========= =============== > 1,250 FIFOs and CBs > 8,899 eager freelists > 270 max freelists > ------ --------------- > 10,418 total > > With no more pagesize dependence and little more cacheline dependence, one > could really start to shoehorn big jobs into a small shared-memory area. > E.g., consider bumping the eager limit down to 256 bytes, the size of a CB > queue to 16 entries, and the chunk size to 8K. Then, shared-memory > consumption for 1024 processes looks like this: > > 106 bytes description > ========= =============== > 311 FIFOs and CBs > 544 eager freelists > 68 max freelists > ------ --------------- > 924 total > > Ping-Pong Latency > > We can also look at performance. Here are OSU latency results for short > messages on a Sun v20z. The absolute numbers are less important than the > relative difference between the two sets: > > bytes before after > > 0 0.85 0.84 µsec > 1 0.97 0.99 > 2 0.97 0.98 > 4 0.97 0.98 > 8 0.97 0.99 > > There is a penalty for non-null messages due to OMPI "data convertors". > Importantly, to within the reproducibility of the measurements, it is > unclear if there is any slowdown that one can attribute to the changes. > (Results are the median of 5 measurements. The values look smooth, but the > error bars, which are difficult to characterize, are probably greater than > the 0.01-0.02 µsec differences seen here.) > > Other Considerations > > Simply going from pagesize alignment to cacheline alignment should be a > relatively unintrusive code change and effect most of the reduction in > shared-memory allocation. > > Also aggregating allocations is more intrusive, but has a few more > advantages, including: > > Slight further reduction in the amount of shared memory allocated. > Less lock contention as the number of allocation is reduced from O(n2) to > O(n), possibly improving start-up performance. > Can provide memory locality even on systems that don't have maffinity > support. > > It would be nice to size the mmap file automatically better than what is > done today, but (as noted) I haven't yet figured out how to make the btl_sm > and mpool_sm components talk to each other. > > My proposed code changes need more testing, especially in the case of > multiple memory nodes per node. > > It remains unclear to me if error codes are being treated properly in the > mca_btl_sm_add_procs() code. E.g., if one process is unable to allocate > memory in the shared area, should all processes fail? > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/