Sayantan Sur wrote:

Hello,

I have udapl over Gen2 setup on our cluster and am able to run udapl
programs. However, sometimes I get this error (after a few runs of the
same program):

open_hca: ERR ib_at_ips_by_gid for mthca0
dapls_ib_open_hca failed 40000

uDAPL uses uAT to get the IP address using the GID (ATS records via SA) of the local device/port. The SA query for this record is failing for some reason. Did your SM bounce during this time? Did you bounce or reconfigure the IPoIB network device?

You can set "env DAPL_DBG_TYPE=0xffff"  for more information.

-arlin

The machine is a AMD Opteron (Tyan S2895), with Mellanox MemFree cards
(fw ver 5.1.0).

lsmod on my machine shows this:

[EMAIL PROTECTED]:~] lsmod | grep ^ib
ib_ipoib 48008 0 ib_uat 14840 0 ib_at 25696 1 ib_uat
ib_sa                  17804  2 ib_ipoib,ib_at
ib_ucm 22280 0 ib_cm 37744 1 ib_ucm ib_uverbs 35992 0 ib_umad 18208 0 ib_mthca 122656 0 ib_mad 44072 4 ib_sa,ib_cm,ib_umad,ib_mthca
ib_core                56192  8
ib_ipoib,ib_sa,ib_ucm,ib_cm,ib_uverbs,ib_umad,ib_mthca,ib_mad

My infiniband devices are (created by hand):

[EMAIL PROTECTED]:~] ls -l /dev/infiniband/
total 0
crw-rw-rw-  1 root root 231, 191 2005-10-20 21:13 uat
crw-rw-rw-  1 root root 231, 224 2005-10-20 21:12 ucm0
crwxrwxrwx  1 root root 231, 192 2005-09-21 04:37 umad0
crwxrwxrwx  1 root root 231, 192 2005-09-16 19:29 uverbs0
crwxrwxrwx  1 root root 231, 192 2005-09-16 19:29 uverbs1


I'd really appreciate if someone could help me understand what might be
going wrong.

Thanks,
Sayantan.


_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to