Sorry for the delay in replies -- it's summer / vacation season, and I think we 
(as a community) are a little behind in answering some of these emails.  :-(

It's hard to say for any given machine, but a bunch of different hardware 
factors can come into play, such as:

- L1, L2, L3 cache sizes
- Cache contention
- Memory controller connectivity and locality

I.e., exactly which hardware resources are the memcpy()'s in question using, 
and how do they interact with each other?  How much overhead is produced, 
and/or how much contention ensues when multiple requests are in flight 
simultaneously?  For example, it may be counter-intuitive, but sometimes 
injecting a small amount of delay in a software pipeline can allow hardware 
resources to not become overwhelmed, and therefore the overall execution 
becomes more efficient, and therefore consume less wall-clock execution time.  
Hence, doing 2 x 1MB memcpy()'s (to effect a 2MB MPI_send) may actually be 
overall more efficient, even though the individual parts of the transaction are 
less efficient.  This is a complete guess, and may have nothing to do with your 
system, but it's one of many possibilities.

Another possible factor: the specific memcpy() implementation is highly 
relevant.  It's been a few years since I've paid close attention to memcpy(), 
but at one time, there was significant variation in the quality of memcpy() 
implementations between different compilers and/or versions of libc.  I don't 
know if this is still a factor, or whether memcpy() is pretty well optimized in 
most situations these days.  Additionally, alignment can be an issue (although 
for message sizes of 2MB, I'm guessing your buffer is page-aligned, and this 
probably isn't an issue).

All that being said, I'm not intimately familiar with the internals of XPMEM, 
so I don't know what userspace/kernel space mechanisms will come into play for 
mapping the shared memory (e.g., is it lazily mapping the shared memory?).

Also, you're probably doing this already, but these kinds of things are worth 
mentioning: make sure your performance benchmarks are testing the right things: 
do warmup transfers, make sure you're not swapping, make sure all the processes 
and memory are pinned properly, make sure you're on an otherwise-quiet machine, 
... etc.  All the Usual Benchmarking Things.

Jeff Squyres

From: devel <> on behalf of Giorgos Katevainis 
via devel <>
Sent: Thursday, July 28, 2022 9:33 AM
To: Open MPI Developers
Cc: Giorgos Katevainis
Subject: [OMPI devel] Rationale behind memcpy chunk size (in smsc/xpmem)

Hello all,

I've come across the "memcpy_chunk_size" MCA parameter in smsc/xpmem, which 
effectively causes
memory copies to take place in chunks (used in mca_smsc_xpmem_memmove()). The 
comment reads:

"Maximum size to copy with a single call to memcpy. On some systems a smaller 
or larger number may
provide better performance (default: 256k)"

And I have indeed observed performance difference by adjusting it! E.g. in a 
simple point-to-point
test, 2 MB messages do significantly better with the parameter set to 1 MB vs 2 
MB. But... why? I
suppose I could imagine a memcpy of larger size being more efficient, but what 
would cause many
small ones to end up being quicker than a single large one? Might it have 
something to do with
memcpy intrinsics and different implementation for different sizes?

If someone knows what's going on under the hood and/or could direct me to any 
relevant resources, I
would greatly appreciate it!


Reply via email to