Hi,

 

  I'm new to the group and am not sure is this is the right place for my
questions - 

 

After I installed OFED on my cluster, I was able to use Infiniband in my
system, but since then the system was moved and all the infiniband
cables, nodes and switch were removed and re-connected. I cannot
initiating my parallel computing jobs any more. Whenever I run ibhosts
command, I got some message like this -

 

>ibhosts

src/query_smp.c:192; umad (DR path slid 0; dlid 0; 0,1,2 Attr 0x11:0)
bad status 110; Connection timed out

Ca      : 0x0011750000ff585f ports 1 "compute-0-9 HCA-1"

Ca      : 0x0011750000ff5815 ports 1 "compute-0-8 HCA-1"

Ca      : 0x0011750000ff5860 ports 1 "compute-0-7 HCA-1"

Ca      : 0x0011750000ff588e ports 1 "compute-0-6 HCA-1"

Ca      : 0x0011750000ff5821 ports 1 "compute-0-5 HCA-1"

Ca      : 0x0011750000ff57f3 ports 1 "compute-0-4 HCA-1"

Ca      : 0x0011750000ff58a3 ports 1 "compute-0-3 HCA-1"

Ca      : 0x0011750000ff579a ports 1 "compute-0-2 HCA-1"

Ca      : 0x0011750000ff57d0 ports 1 "compute-0-1 HCA-1"

Ca      : 0x0011750000ff58a2 ports 1 "kratos HCA-1"

 

I wonder where I can get more helpful debug information on this kind of
issues? I ran dmesg but didn't seem to get any related messages -

>bmesg

ib_qib 0000:07:00.0: IB0:1 got a lid: 0x1

ib_mad: Method 1 already in use

 

Thanks.

Bright Yang

_______________________________________________
ewg mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Reply via email to