Hi Eli, Thanks for the suggestion. Unfortunately, I have now reproduced this same problem on a group of 8 Xeon based systems as well, so the problem is not specific to the Opterons.
Thanks, Rick On Tuesday 28 October 2008, Eli Cohen wrote: > On Mon, Oct 27, 2008 at 06:38:48PM -0400, Rick Warner wrote: > > Hi all, > > > > I am configuring an opteron cluster with connectX Infiniband. I have a > > problem that if I run one of the NAS tests, it works the first, and maybe > > 2nd time, but after that the jobs instantly fail with messages like this- > > > > [Rank 44][cm.c: line 860]poll CQ failed -2 > > [Rank 51][cm.c: line 860]poll CQ failed -2 > > [Rank 119][cm.c: line 860]poll CQ failed -2 > > [Rank 85][cm.c: line 860]poll CQ failed -2 > > [Rank 0][cm.c: line 860]poll CQ failed -2 > > [Rank 9][cm.c: line 860]poll CQ failed -2 > > [Rank 26][cm.c: line 860]poll CQ failed -2[Rank 43][cm.c: line 860] > > poll CQ failed -2 > > [Rank 94][cm.c: line 860]poll CQ failed -2 > > [Rank 111][cm.c: line 860]poll CQ failed -2 > > This error means that a CQE was polled which belongs to a none > existent QP. But, I do remember a case with an Opteron which > experienced the same problem and eventually it appeared that it was a > system problem that was resolved after a BIOS update. Can you check if > there is an update to your system's BIOS? > > > I can easily reproduce this with only 2 systems using a 16 process LU > > job, class B. > > > > Here are the configs I've tried- > > Suse 11 with distro provided IB driver and libraries,etc, using mvapich > > as provided by ohio state > > Suse 11 with distro driver, using OFED 1.3.1 libraries and mvapich > > Suse 10.3 with OFED 1.3.1, OFED 1.2.5.4, and OFED 1.4rc3 > > > > They all have the same basic problem. I think one of them reported > > "Error polling CQ" instead of "poll CQ failed". > > > > If I replace the connectX cards with regular DDR cards the problem goes > > away. > > > > I'm getting quite stumped at this point and would appreciate any > > suggestions or patches. > > > > Thanks, > > Rick > > -- > > Richard Warner > > Lead Systems Integrator > > Microway, Inc > > (508)732-5517 > > _______________________________________________ > > general mailing list > > [email protected] > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general -- Richard Warner Lead Systems Integrator Microway, Inc (508)732-5517 _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
