I had similar issue; it was apparently not a lustre issue for us. In addition to the entries, you see below we also saw "AMD-Vi: Event ... IO_PAGE_FAULT " in the logs.
Setting iommu=pt helped us. Hope that helps. Thank you, Amit -----Original Message----- From: lustre-discuss <[email protected]> On Behalf Of Nehring, Shane R [LAS] via lustre-discuss Sent: Thursday, May 18, 2023 10:06 AM To: [email protected] Subject: [lustre-discuss] mlx5 errors on oss Hello all, We recently added infiniband to our cluster and are in the process of testing it with lustre. We're running the distro provided drivers for the mellanox cards with the latest firmware. Overnight we started seeing the following errors on a few oss: infiniband mlx5_0: dump_cqe:272:(pid 40058): dump error cqe 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000030: 00 00 00 00 00 00 88 13 08 00 00 a0 00 63 4d d2 infiniband mlx5_0: dump_cqe:272:(pid 40057): dump error cqe 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000030: 00 00 00 00 00 00 88 13 08 00 00 a1 00 c2 8e d2 infiniband mlx5_0: dump_cqe:272:(pid 40057): dump error cqe 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000030: 00 00 00 00 00 00 88 13 08 00 00 a2 00 1a 12 d2 I found a post suggesting this might be iommu related, disabling the iommu doesn't seem to help any. We're running luster 2.15, more or less at the tip of b2_15 (b74560d74a9f890838dbf2f0719e3d27c1e5eaf8) Has anyone seen this before or have any pointers? Thanks Shane _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
