I have a case where ib_cm_open_device() is failing for an odd reason: I have 12 servers that contain both HCAs and iWARP NICs. In most cases, everything is fine. But one one of these servers, IBCM refuses to work -- ib_cm_open_device() fails with the following:

libibcm: unable to open /dev/infiniband/ucm1

Looking closer, this device does, indeed, exist:

[4:01] svbu-mpi044:~/mpi % ls -l /dev/infiniband/ucm*
crw-rw-rw-  1 root root 231, 224 Jul 16 04:30 /dev/infiniband/ucm0
crw-rw-rw-  1 root root 231, 225 Jul 16 04:30 /dev/infiniband/ucm1
[4:08] svbu-mpi044:~/mpi %

Granted; I had to create these devices manually because they are not created automatically for me upon boot in RHEL4U4 and U6. These device major/minor numbers work fine for me on all my other servers.

So what's different between the 11 machines that work and the 1 that doesn't? It seems that the kernel ordering of devices is what is different. On most of the machines:

[4:10] svbu-mpi045:~ % ibv_devinfo
hca_id: mlx4_0
        fw_ver:                         2.3.000
        node_guid:                      0002:c903:0000:036c
        sys_image_guid:                 0002:c903:0000:036f
        vendor_id:                      0x02c9
        vendor_part_id:                 25418
        hw_ver:                         0xA0
        board_id:                       MT_04A0110002
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 2
                        port_lid:               7
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 2
                        port_lid:               52
                        port_lmc:               0x00

hca_id: nes0
        node_guid:                      0012:5502:63c0:0000
        sys_image_guid:                 0012:5502:63c0:0000
        vendor_id:                      0x0000
        vendor_part_id:                 0
        hw_ver:                         0x5
        board_id:                       NES020 Board ID
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 0
                        port_lid:               1
                        port_lmc:               0x00

But on this one problematic machine:

4:10] svbu-mpi044:~/mpi % ibv_devinfo
hca_id: nes0
        node_guid:                      0012:5502:63b8:0000
        sys_image_guid:                 0012:5502:63b8:0000
        vendor_id:                      0x0000
        vendor_part_id:                 0
        hw_ver:                         0x5
        board_id:                       NES020 Board ID
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 0
                        port_lid:               1
                        port_lmc:               0x00

hca_id: mlx4_0
        fw_ver:                         2.3.000
        node_guid:                      0002:c903:0000:03b0
        sys_image_guid:                 0002:c903:0000:03b3
        vendor_id:                      0x02c9
        vendor_part_id:                 25418
        hw_ver:                         0xA0
        board_id:                       MT_04A0110002
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 2
                        port_lid:               6
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 2
                        port_lid:               136
                        port_lmc:               0x00

Notice that the ordering is different.

So I'm not enough of a kernel guy to know where the problem is:

1. Technically, mlx4_0 is the first IB device. Should it therefore be using ucm0? I.e., is libibcm wrong for trying to use ucm1? (note that OMPI's openib BTL is currently replicating the logic from libibcm to check for the Right ucm* file so that we can silently fail before ib_cm_open_device() fails with a warning message -- so if libibcm's logic to find the Right ucm* file changes, we'll also need to change OMPI's logic to mirror it. OMPI's logic becomes moot in newer libibcm versions where Sean removed the warning message, though).

2. Or are my major/minor numbers incorrect for the devices that I created manually? If the major/minor device numbers were created by the OS upon bootup (as they should be -- there's an open OpenFabrics bugzilla ticket about this), would they be correct?

--
Jeff Squyres
Cisco Systems

Reply via email to