Sébastien Boisvert wrote:
Now I can describe the cases.
The test cases can all be explained by the test requiring eager messages
(something that test4096.cpp does not require).
Case 1: 30 MPI ranks, message size is 4096 bytes
File: mpirun-np-30-Program-4096.txt
Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
4096 is rendezvous. For eager, try 4000 or lower.
Case 2: 30 MPI ranks, message size is 1 byte
File: mpirun-np-30-Program-1.txt.gz
Outcome: It runs just fine.
1 byte is eager.
Case 3: 2 MPI ranks, message size is 4096 bytes
File: mpirun-np-2-Program-4096.txt
Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
Same as Case 1.
Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is
disabled
File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
Outcome: It runs just fine.
Eager limit for TCP is 65536 (perhaps less some overhead). So, these
messages are eager.