On May 31, 2007, at 7:25 PM, Ralph Campbell wrote:
I can run the Intel MPI benchmarks OK at np=2 but at np=4, it hangs.
Bummer.
If I change /usr/share/openmpi/mca-btl-openib-hca-params.ini [QLogic InfiniPath] use_eager_rdma = 0
FYI, you can change such values on the command line and/or environment -- see http://www.open-mpi.org/faq/? category=tuning#setting-mca-params. The MCA parameter in question is btl_openib_use_eager_rdma.
Then, it gets much farther before hanging on 2MB+ messages. If I create .openmpi/mca-params.conf with min_rdma_size = 2147483648 The benchmark completes reliably.
Yoinks. I assume you mean btl_openib_min_rdma_size, right? (note that the name slightly changed for the upcoming 1.3 [i.e., the SVN trunk]; although the old name is deprecated, it'll still work)
When the hang happens, the ipath driver thinks all the posted work requests and completion entries have been generated and openmpi seems to think they haven't all completed. Can someone point me to the code where RDMA write is polled on the destination node?
All the OFA code in OMPI is in ompi/mca/btl/openib (i.e., the "openib" BTL plugin).
The completion polling occurs in btl_openib_component.c, in two main functions: btl_openib_component_progress() and btl_openib_module_progress(). The component progress function mainly checks for eager RDMA progress; if there are none (per your setting use_eager_rdma to 0), it'll fall through to the module progress() function. There's one module "instance" for each HCA port, so we basically loop over checking each module (port).
Galen tells me that it may be a little more subtle than this, such as an ordering issue -- he's going to reply with more detail shortly.
-- Jeff Squyres Cisco Systems