We are using open-mpi on several 1000+ node clusters. We received several new clusters using the Infiniserve 3.X software stack recently and are having several problems with the vapi btl (yes, I know, it is very very old and shouldn't be used. I couldn't agree with you more but those are my marching orders).
I have a new application that is running into swap for an unknown reason. If I run and force it to use the tcp btl I don't seem to run into swap (the job just takes a very very long time). I have tried restricting the size of the free lists, forcing to use send mode, and use an open-mpi compiled w/ no memory manager but nothing seems to help. I've profiled with valgrind --tool=massif and the memtrace capabilities of ptmalloc but I don't have any smoking guns yet. It is a fortran app an I don't know anything about debugging fortran memory problems, can someone point me in the proper direction? Thanks, Josh