Hello all, We recently added infiniband to our cluster and are in the process of testing it with lustre. We're running the distro provided drivers for the mellanox cards with the latest firmware. Overnight we started seeing the following errors on a few oss:
infiniband mlx5_0: dump_cqe:272:(pid 40058): dump error cqe 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000030: 00 00 00 00 00 00 88 13 08 00 00 a0 00 63 4d d2 infiniband mlx5_0: dump_cqe:272:(pid 40057): dump error cqe 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000030: 00 00 00 00 00 00 88 13 08 00 00 a1 00 c2 8e d2 infiniband mlx5_0: dump_cqe:272:(pid 40057): dump error cqe 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000030: 00 00 00 00 00 00 88 13 08 00 00 a2 00 1a 12 d2 I found a post suggesting this might be iommu related, disabling the iommu doesn't seem to help any. We're running luster 2.15, more or less at the tip of b2_15 (b74560d74a9f890838dbf2f0719e3d27c1e5eaf8) Has anyone seen this before or have any pointers? Thanks Shane
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
