Pawel Dziekonski wrote:
Hi,
from time to time I get Catastrophic errors like below. software stack is
kernel 2.6.18-92.1.10.el5 with Lustre client. device and OFED info is also
below.
any hints?
thanks in advance, Pawel
06:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev
20)
# ibv_devices
device node GUID
------ ----------------
mthca0 0030487e07700000
# ibv_devinfo
hca_id: mthca0
fw_ver: 1.2.0
node_guid: 0030:487e:0770:0000
sys_image_guid: 0030:487e:0770:0003
vendor_id: 0x02c9
vendor_part_id: 25204
hw_ver: 0xA0
board_id: SM_0000000003
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 441
port_lmc: 0x00
kernel: ib_mthca 0000:06:00.0: Catastrophic error detected: unknown error
kernel: ib_mthca 0000:06:00.0: buf[00]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[01]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[02]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[03]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[04]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[05]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[06]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[07]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[08]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[09]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[0a]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[0b]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[0c]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[0d]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[0e]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[0f]: ffffffff
kernel: ib_mthca 0000:06:00.0: HW2SW_MPT failed (-11)
kernel: ib_mthca 0000:06:00.0: HW2SW_MPT failed (-11)
kernel: ib0: ib_detach_mcast failed (result = -11)
kernel: ib0: ipoib_mcast_detach failed (result = -11)
kernel: ib0: ib_detach_mcast failed (result = -11)
kernel: ib0: ipoib_mcast_detach failed (result = -11)
kernel: ib0: Failed to modify QP to ERROR state
kernel: ib0: timing out; 0 sends 128 receives not completed
kernel: ib0: Failed to modify QP to RESET state
kernel: ib_mthca 0000:06:00.0: HW2SW_MPT failed (-11)
kernel: ib_mthca 0000:06:00.0: HW2SW_CQ failed (-11)
kernel: ib_mthca 0000:06:00.0: HW2SW_MPT failed (-11)
kernel: ib_mthca 0000:06:00.0: HW2SW_SRQ failed (-11)
kernel: ib_mthca 0000:06:00.0: HW2SW_MPT failed (-11)
kernel: ib_mthca 0000:01:00.0: Catastrophic error detected: internal parity
error
kernel: ib_mthca 0000:01:00.0: buf[00]: 05000000
kernel: ib_mthca 0000:01:00.0: buf[01]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[02]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[03]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[04]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[05]: 00127f2c
kernel: ib_mthca 0000:01:00.0: buf[06]: 000a0056
kernel: ib_mthca 0000:01:00.0: buf[07]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[08]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[09]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[0a]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[0b]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[0c]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[0d]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[0e]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[0f]: 00000000
kernel: ib0: ib_query_port failed
This is a known issue with Infinihost III HCA FW 1.2.0
Please contact Mellanox support to get an updated version for the FW
Tziporet
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general