The osu_bibw micro-benchmark from Ohio State's OMB 3.1 suite hangs when run over OpenMPI 1.2.5 from OFED 1.3 using the OpenIB BTL if there is insufficient lockable memory. 128MB of lockable memory gives a hang when the test gets to 4MB messages, while 512MB is sufficient for it to pass. I observed this with InfiniPath and Mellanox adapter cards, and see the same behavior with 1.2.6 too. I know the general advice is to use an unlimited or very large setting (per the FAQ), but there are reasons for clusters to set finite user limits.
For each message size in the loop, osu_bibw posts 64 non-blocking sends followed by 64 non-blocking receives on both ranks followed by a wait for them all to complete. 64 is the default value for the window size (number of concurrent messages). For 4MB messages this is 256MB of memory to be sent which is more than exhausting the 128MB of lockable memory on these systems. The OpenIB BTL does ib_reg_mr for as many of the sends as it can and the rest wait on a pending list. Then the ib_reg_mr for all the posted receives all fail as well due to the ulimit check, and all of them have to wait on a pending list too. This means that neither rank actually gets to do an ib_post_recv, neither side can make progress and the benchmark hangs without completing a single 4MB message! This contrasts with the uni-directional osu_bw where one side does sends and the other does receives and progress can be made. This is admittedly a hard problem to solve in the general case. It is unfortunate that this leads to a hang, rather than a message advising the user to check ulimits. Maybe there should be a warning the first time that the ulimit is exceeded to alert the user to the problem. One solution would be to divide the ulimit up into separate limits for sending and receiving, so that excessive sending does not block all receiving. This would require OpenMPI to keep track of the ulimit usage separately for send and receive. In this particular synthetic benchmark there turns out to be a straightforward workaround. The benchmark actually sends from the same buffer 64 times over, and receives into another buffer 64 times over (all posted concurrently). Thus there are really only two 4MB buffers at play here, though the kernel IB code charges the user separately for all 64 registrations of each even though the user already has those pages locked. In fact, the linux implementation of mlock (over)charges in the same way so I guess that choice is intentional and that the additional complexity in spotting the duplicated locked pages wasn't worthwhile. This leads to the workaround of using --mca mpi_leave_pinned 1. This turns on the code in the OpenIB BTL that caches the descriptors so that there is only 1 ib_reg_mr for the send buffer and 1 ib_reg_mr for the receive buffer, and all the others hit the descriptor cache. This saves the day and the benchmark runs without problem. If this was the default option then this might save much consternation for the user. For this workaround, note that there isn't any need for the descriptors to be left pinned after the send/recv complete, all that is needed is the caching while they are posted. So one could default to having the descriptor caching mechanism enabled even when mpi_leave_pinned is off. Also note that this is still a workaround that happens to be sufficient for the osu_bibw case but isn't a general panacea. osu_bibw and osu_bw are "broken" anyway in that it is illegal to post multiple concurrent receives in the same receive buffer. I believe this is done to minimize CPU cache effects and maximize measured bandwidth. Anyway, having multiple posted sends from the same send buffer is reasonable (eg. a broadcast) so caching those descriptors and reducing lockable memory usage seems like a good idea to me. Although osu_bibw is very synthetic it is conceivable that other real codes with large messages could see the hangs (eg. just MPI_Sendrecv a message larger than ulimit -l?). Cheers, Mark.