mvapich 1. (0.9.9, 1.0.1, 1.1.0, depending on the OFED version, etc) Thanks, Rick On Tuesday 28 October 2008, Pavel Shamis (Pasha) wrote: > Which MPI implementation do you use ? > > Rick Warner wrote: > > On Monday 27 October 2008, Rick Warner wrote: > >> Hi all, > >> > >> I am configuring an opteron cluster with connectX Infiniband. I have a > >> problem that if I run one of the NAS tests, it works the first, and > >> maybe 2nd time, but after that the jobs instantly fail with messages > >> like this- > >> > >> [Rank 44][cm.c: line 860]poll CQ failed -2 > >> [Rank 51][cm.c: line 860]poll CQ failed -2 > >> [Rank 119][cm.c: line 860]poll CQ failed -2 > >> [Rank 85][cm.c: line 860]poll CQ failed -2 > >> [Rank 0][cm.c: line 860]poll CQ failed -2 > >> [Rank 9][cm.c: line 860]poll CQ failed -2 > >> [Rank 26][cm.c: line 860]poll CQ failed -2[Rank 43][cm.c: line 860] > >> poll CQ failed -2 > >> [Rank 94][cm.c: line 860]poll CQ failed -2 > >> [Rank 111][cm.c: line 860]poll CQ failed -2 > >> > >> I can easily reproduce this with only 2 systems using a 16 process LU > >> job, class B. > >> > >> Here are the configs I've tried- > >> Suse 11 with distro provided IB driver and libraries,etc, using mvapich > >> as provided by ohio state > >> Suse 11 with distro driver, using OFED 1.3.1 libraries and mvapich > >> Suse 10.3 with OFED 1.3.1, OFED 1.2.5.4, and OFED 1.4rc3 > >> > >> They all have the same basic problem. I think one of them reported > >> "Error polling CQ" instead of "poll CQ failed". > >> > >> If I replace the connectX cards with regular DDR cards the problem goes > >> away. > >> > >> I'm getting quite stumped at this point and would appreciate any > >> suggestions or patches. > >> > >> Thanks, > >> Rick > > > > I forgot to mention- on Suse 11 I also tried a manually compiled 2.6.26.4 > > and 2.6.27.2 kernel, using the in kernel drivers. > > > > Thanks, > > Rick > > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general
-- Richard Warner Lead Systems Integrator Microway, Inc (508)732-5517 _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
