Which MPI implementation do you use ?

Rick Warner wrote:
On Monday 27 October 2008, Rick Warner wrote:
Hi all,

I am configuring an opteron cluster with connectX Infiniband.  I have a
problem that if I run one of the NAS tests, it works the first, and maybe
2nd time, but after that the jobs instantly fail with messages like this-

[Rank 44][cm.c: line 860]poll CQ failed -2
[Rank 51][cm.c: line 860]poll CQ failed -2
[Rank 119][cm.c: line 860]poll CQ failed -2
[Rank 85][cm.c: line 860]poll CQ failed -2
[Rank 0][cm.c: line 860]poll CQ failed -2
[Rank 9][cm.c: line 860]poll CQ failed -2
[Rank 26][cm.c: line 860]poll CQ failed -2[Rank 43][cm.c: line 860]
poll CQ failed -2
[Rank 94][cm.c: line 860]poll CQ failed -2
[Rank 111][cm.c: line 860]poll CQ failed -2

I can easily reproduce this with only 2 systems using a 16 process LU job,
class B.

Here are the configs I've tried-
Suse 11 with distro provided IB driver and libraries,etc, using mvapich as
provided by ohio state
Suse 11 with distro driver, using OFED 1.3.1 libraries and mvapich
Suse 10.3 with OFED 1.3.1, OFED 1.2.5.4, and OFED 1.4rc3

They all have the same basic problem.  I think one of them reported "Error
polling CQ" instead of "poll CQ failed".

If I replace the connectX cards with regular DDR cards the problem goes
away.

I'm getting quite stumped at this point and would appreciate any
suggestions or patches.

Thanks,
Rick

I forgot to mention- on Suse 11 I also tried a manually compiled 2.6.26.4 and 2.6.27.2 kernel, using the in kernel drivers.

Thanks,
Rick


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to