On 2/3/2013 2:11 AM, Or Gerlitz wrote:
On 03/02/2013 07:14, Bharath Ramesh wrote:
Intermittently a couple of nodes in our cluster throw the error "Failed to obtain HW semaphore, aborting" on boot. When this error occurs we are unable to use IB on those nodes, unloading and reloading the module doesnt help.

load mlx4_core with debug_level=1 and send the resulted dmesg along with the lspci info of the card ("$ lspci | grep Mellanox")
The same node will come up fine on some reboots and on others I will get this error. Here is the output from lspci
$ lspci | grep Mellanox
01:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

dmesg output trying to load mlx4_core with debug_level=1

mlx4_core: Mellanox ConnectX core driver v1.0-ofed1.5.4 (November 10, 2011)
mlx4_core: Initializing 0000:01:00.0
mlx4_core 0000:01:00.0: PCI INT A -> GSI 26 (level, low) -> IRQ 26
mlx4_core 0000:01:00.0: setting latency timer to 64
Uhhuh. NMI received for unknown reason 31 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 31 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 21 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
mlx4_core 0000:01:00.0: Failed to obtain HW semaphore, aborting
mlx4_core 0000:01:00.0: Failed to reset HCA, aborting.
mlx4_core 0000:01:00.0: PCI INT A disabled
mlx4_core: probe of 0000:01:00.0 failed with error -11

I am unable to run ibv_devinfo on the bad node, here is an output from a good node
$ ibv_devinfo
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.10.2370
        node_guid:                      001e:6703:003c:dff4
        sys_image_guid:                 001e:6703:003c:dff7
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x0
        board_id:                       INCX-3I358C10501
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 358
                        port_lid:               331
                        port_lmc:               0x00
                        link_layer:             IB

--
Bharath


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to