Sorry for the delay in replies -- it's summer / vacation season, and I think we (as a community) are a little behind in answering some of these emails. :-(
It's hard to say for any given machine, but a bunch of different hardware factors can come into play, such as: - L1, L2, L3 cache sizes - Cache contention - Memory controller connectivity and locality I.e., exactly which hardware resources are the memcpy()'s in question using, and how do they interact with each other? How much overhead is produced, and/or how much contention ensues when multiple requests are in flight simultaneously? For example, it may be counter-intuitive, but sometimes injecting a small amount of delay in a software pipeline can allow hardware resources to not become overwhelmed, and therefore the overall execution becomes more efficient, and therefore consume less wall-clock execution time. Hence, doing 2 x 1MB memcpy()'s (to effect a 2MB MPI_send) may actually be overall more efficient, even though the individual parts of the transaction are less efficient. This is a complete guess, and may have nothing to do with your system, but it's one of many possibilities. Another possible factor: the specific memcpy() implementation is highly relevant. It's been a few years since I've paid close attention to memcpy(), but at one time, there was significant variation in the quality of memcpy() implementations between different compilers and/or versions of libc. I don't know if this is still a factor, or whether memcpy() is pretty well optimized in most situations these days. Additionally, alignment can be an issue (although for message sizes of 2MB, I'm guessing your buffer is page-aligned, and this probably isn't an issue). All that being said, I'm not intimately familiar with the internals of XPMEM, so I don't know what userspace/kernel space mechanisms will come into play for mapping the shared memory (e.g., is it lazily mapping the shared memory?). Also, you're probably doing this already, but these kinds of things are worth mentioning: make sure your performance benchmarks are testing the right things: do warmup transfers, make sure you're not swapping, make sure all the processes and memory are pinned properly, make sure you're on an otherwise-quiet machine, ... etc. All the Usual Benchmarking Things. -- Jeff Squyres jsquy...@cisco.com ________________________________________ From: devel <devel-boun...@lists.open-mpi.org> on behalf of Giorgos Katevainis via devel <devel@lists.open-mpi.org> Sent: Thursday, July 28, 2022 9:33 AM To: Open MPI Developers Cc: Giorgos Katevainis Subject: [OMPI devel] Rationale behind memcpy chunk size (in smsc/xpmem) Hello all, I've come across the "memcpy_chunk_size" MCA parameter in smsc/xpmem, which effectively causes memory copies to take place in chunks (used in mca_smsc_xpmem_memmove()). The comment reads: "Maximum size to copy with a single call to memcpy. On some systems a smaller or larger number may provide better performance (default: 256k)" And I have indeed observed performance difference by adjusting it! E.g. in a simple point-to-point test, 2 MB messages do significantly better with the parameter set to 1 MB vs 2 MB. But... why? I suppose I could imagine a memcpy of larger size being more efficient, but what would cause many small ones to end up being quicker than a single large one? Might it have something to do with memcpy intrinsics and different implementation for different sizes? If someone knows what's going on under the hood and/or could direct me to any relevant resources, I would greatly appreciate it! George