Hello all, I've come across the "memcpy_chunk_size" MCA parameter in smsc/xpmem, which effectively causes memory copies to take place in chunks (used in mca_smsc_xpmem_memmove()). The comment reads:
"Maximum size to copy with a single call to memcpy. On some systems a smaller or larger number may provide better performance (default: 256k)" And I have indeed observed performance difference by adjusting it! E.g. in a simple point-to-point test, 2 MB messages do significantly better with the parameter set to 1 MB vs 2 MB. But... why? I suppose I could imagine a memcpy of larger size being more efficient, but what would cause many small ones to end up being quicker than a single large one? Might it have something to do with memcpy intrinsics and different implementation for different sizes? If someone knows what's going on under the hood and/or could direct me to any relevant resources, I would greatly appreciate it! George