Hi all,
I have "inherited" a small cluster with a head node and four compute
nodes which
I have to administer.  The nodes are connected via infiniband (OFED). When I
do a

cexec :1-4 ibstatus

I get someinformation indicating that the infiniband is sort of available:

************************* oscar_cluster *************************
--------- n01---------
Infiniband device 'mthca0' port 1 status:
        default gid:     fe80:0000:0000:0000:0002:c902:0025:930d
        base lid:        0x1
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            10 Gb/sec (4X)

--------- n02---------
Infiniband device 'mthca0' port 1 status:
        default gid:     fe80:0000:0000:0000:0002:c902:0025:931d
        base lid:        0x3
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            10 Gb/sec (4X)

--------- n03---------
        default gid:     fe80:0000:0000:0000:0002:c902:0025:9321
        base lid:        0x5
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            10 Gb/sec (4X)

--------- n04---------
Infiniband device 'mthca0' port 1 status:
        default gid:     fe80:0000:0000:0000:0002:c902:0025:9201
        base lid:        0x2
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            10 Gb/sec (4X)




However, when I  start runing an mpi job I get the following message
indicating that the infiniband is not working (I am definitely using the
mpi-libs compiled with infiniband support):

[0,1,0]: uDAPL on host n01 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[0,1,2]: uDAPL on host n01 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[0,1,3]: uDAPL on host n02 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[0,1,1]: uDAPL on host n02 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------

I am a complete novice in the infiniband area, so can anybody give me
some advise
what's going wrong here and how to get the jobs running with infiniband?


Thanks for any help

Michael


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to