Title: RFC: Fragmented sm Allocations
WHY: To reduce consumption of shared memory, making job startup more robust, and possibly improving the scalability of startup. WHERE: In mca_btl_sm_add_procs(), there is a loop over calls to ompi_fifo_init(). This is where CBs are initialized one at a time, components of a CB allocated individually. Changes can be seen in ssh://www.open-mpi.org/~eugene/hg/sm-allocation. WHEN: Upon acceptance. TIMEOUT: January 30, 2009. WHY (details)The sm BTL establishes a FIFO for each non-self, on-node connection. Each FIFO is initialized during MPI_Init() with a circular buffer (CB). (More CBs can be added later in program execution if a FIFO runs out of room.) A CB has different components that are used in different ways:
For performance reasons, a CB is not allocated as one large data structure. Rather, these components are laid out separately in memory and the wrapper has pointers to the various locations. Performance considerations include:
Currently, the sm BTL handles these issues by allocating each component of each CB its own page. (Actually, it couples tails and queues together.) As the number of on-node processes grows, however, the shared-memory allocation skyrockets. E.g., let's say there are n processes on-node. There are therefore n(n-1) = O(n2) FIFOs, each with 3 allocations (wrapper, head, and tail/queue). The shared-memory allocation for CBs becomes 3n2 pages. For large n, this dominates the shared-memory consumption, even though most of the CB allocation is unused. E.g., a 12-byte "head" ends up consuming a full memory page! Not only is the 3n2-page allocation large, but it is also not tunable via any MCA parameters. Large shared-memory consumption has led to some number of start-up and other user problems. E.g., the e-mail thread at http://www.open-mpi.org/community/lists/devel/2008/11/4882.php. WHAT (details)Several actions are recommended here. 1. Cacheline Rather than Pagesize AlignmentThe first set of changes reduces pagesize to cacheline alignment. Though mapping to pages is motivated by NUMA locality, note:
The changes are:
2. Aggregated AllocationAnother option is to lay out all the CBs at once and aggregate their allocations. This may have the added benefit of reducing lock contention during MPI_Init(). On the one hand, the 3n2 CB allocations during MPI_Init() contend for a single mca_common_sm_mmap->map_seg->seg_lock lock. On the other hand, I know so far of no data showing that this lock contention is impairing start-up scalability. The objectives here would be to consolidate many CB components together subject to:
In sum, for process myrank, the FIFO allocation in shared memory during MPI_Init() looks something like this: ompi_fifo_t from 0 to myrank ompi_fifo_t from 1 to myrank ompi_fifo_t from 2 to myrank ... ompi_fifo_t from n-1 to myrank --- cacheline boundary --- queue of pointers, for CB from 0 to myrank queue of pointers, for CB from 1 to myrank queue of pointers, for CB from 2 to myrank ... queue of pointers, for CB from n-1 to myrank --- cacheline boundary --- head for CB from myrank to 0 tail for CB from 0 to myrank head for CB from myrank to 1 tail for CB from 1 to myrank head for CB from myrank to 2 tail for CB from 2 to myrank ... head for CB from myrank to n-1 tail for CB from n-1 to myrank --- cacheline boundary --- wrapper, CB from 0 to myrank wrapper, CB from 1 to myrank wrapper, CB from 2 to myrank ... wrapper, CB from n-1 to myrank
The changes are:
These changes impact only the allocation of CBs during MPI_Init(). If FIFOs are grown later during program execution, they will continue to have components allocated in a fragmented manner. 3. Free List Return CodesThis is unrelated to FIFOs, but is related to more robust handling of shared-memory allocation. The function sm_btl_first_time_init() should test the return values when it allocates free lists. It currently does not test return values, proceeding without a hiccup even if those allocations indicate an error. The proposed change is:
4. Better Automatic Sizing of mmap FileCurrently, the size of the file to be mmaped is governed by three MCA parameters:
Specifically, the file size is min(mpool_sm_max_size, max(mpool_sm_min_size, n * mpool_sm_per_peer_size)) This file size is a poor approximation for the actual amount of shared memory needed by an application during MPI_Init(). E.g., at n=2, the file is 128M even though less than 1M is needed. At large n, however, the file is insufficiently small. Instead, we should add code that produces a better estimate of how much shared memory will be needed during MPI_Init(). Regarding the MCA parameters:
More accurate sizing could help reduce the problems users see starting sm jobs with large on-node-process counts. One problem is that the size of the shared file is set by mpool_sm, but information about how much shared memory needs to be allocated during MPI_Init() is in btl_sm. Since OMPI doesn't easily allow components to call one another, we're stuck. Supporting DataMemory ConsumptionMemory consumption can be measured or modeled. (I have a byte-accurate model.) Here are some comparisons for the case of:
Here are breakdowns of the shared-memory allocations during MPI_Init() in units of 106 bytes: pagesize alignment cacheline ------------------ alignment description 8K pages 4K pages =============== CB wrappers 8,682 4,391 235 CB queues+tails 9,822 5,531 1,374 CB heads 8,632 4,341 184 eager freelists 9,171 9,032 8,898 other 370 362 355 --------------- total 36,677 23,658 11,046 That is, with pagesize alignment, the CB allocations consume over 3n2 pages and dominate, even though most of that space is unused. The next biggest contributor is the eager freelists. There are 2n2 eager fragments, each 4K (the eager limit), costing (approximately) 8 Gbytes. With cacheline alignment:
Here are results when we not only drop from page-size to cacheline alignment, but we also aggregate CB allocations: 106 bytes description ========= =============== 1,250 FIFOs and CBs 8,899 eager freelists 270 max freelists ------ --------------- 10,418 total With no more pagesize dependence and little more cacheline dependence, one could really start to shoehorn big jobs into a small shared-memory area. E.g., consider bumping the eager limit down to 256 bytes, the size of a CB queue to 16 entries, and the chunk size to 8K. Then, shared-memory consumption for 1024 processes looks like this: 106 bytes description ========= =============== 311 FIFOs and CBs 544 eager freelists 68 max freelists ------ --------------- 924 total Ping-Pong LatencyWe can also look at performance. Here are OSU latency results for short messages on a Sun v20z. The absolute numbers are less important than the relative difference between the two sets: bytes before after 0 0.85 0.84 µsec 1 0.97 0.99 2 0.97 0.98 4 0.97 0.98 8 0.97 0.99 There is a penalty for non-null messages due to OMPI "data convertors". Importantly, to within the reproducibility of the measurements, it is unclear if there is any slowdown that one can attribute to the changes. (Results are the median of 5 measurements. The values look smooth, but the error bars, which are difficult to characterize, are probably greater than the 0.01-0.02 µsec differences seen here.) Other ConsiderationsSimply going from pagesize alignment to cacheline alignment should be a relatively unintrusive code change and effect most of the reduction in shared-memory allocation. Also aggregating allocations is more intrusive, but has a few more advantages, including:
It would be nice to size the mmap file automatically better than what is done today, but (as noted) I haven't yet figured out how to make the btl_sm and mpool_sm components talk to each other. My proposed code changes need more testing, especially in the case of multiple memory nodes per node. It remains unclear to me if error codes are being treated properly in the mca_btl_sm_add_procs() code. E.g., if one process is unable to allocate memory in the shared area, should all processes fail? |