Hi,
I'm new to the group and am not sure is this is the right place for my questions - After I installed OFED on my cluster, I was able to use Infiniband in my system, but since then the system was moved and all the infiniband cables, nodes and switch were removed and re-connected. I cannot initiating my parallel computing jobs any more. Whenever I run ibhosts command, I got some message like this - >ibhosts src/query_smp.c:192; umad (DR path slid 0; dlid 0; 0,1,2 Attr 0x11:0) bad status 110; Connection timed out Ca : 0x0011750000ff585f ports 1 "compute-0-9 HCA-1" Ca : 0x0011750000ff5815 ports 1 "compute-0-8 HCA-1" Ca : 0x0011750000ff5860 ports 1 "compute-0-7 HCA-1" Ca : 0x0011750000ff588e ports 1 "compute-0-6 HCA-1" Ca : 0x0011750000ff5821 ports 1 "compute-0-5 HCA-1" Ca : 0x0011750000ff57f3 ports 1 "compute-0-4 HCA-1" Ca : 0x0011750000ff58a3 ports 1 "compute-0-3 HCA-1" Ca : 0x0011750000ff579a ports 1 "compute-0-2 HCA-1" Ca : 0x0011750000ff57d0 ports 1 "compute-0-1 HCA-1" Ca : 0x0011750000ff58a2 ports 1 "kratos HCA-1" I wonder where I can get more helpful debug information on this kind of issues? I ran dmesg but didn't seem to get any related messages - >bmesg ib_qib 0000:07:00.0: IB0:1 got a lid: 0x1 ib_mad: Method 1 already in use Thanks. Bright Yang
_______________________________________________ ewg mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
