That was helpful, thank you. In our case it's looking like it was a client that was suffering from issues with the iommu, we were seeing identical AMD-Vi errors on that client. Once this client was rebooted the errors stopped on the servers. I've set iommu=off since we don't actually need it enabled on these nodes.
On Thu, 2023-05-18 at 16:19 +0000, Kumar, Amit wrote: > I had similar issue; it was apparently not a lustre issue for us. In addition > to the entries, you see below we also saw "AMD-Vi: Event ... IO_PAGE_FAULT " > in the logs. > > Setting iommu=pt helped us. > > Hope that helps. > > Thank you, > Amit > > -----Original Message----- > From: lustre-discuss <[email protected]> On Behalf Of > Nehring, Shane R [LAS] via lustre-discuss > Sent: Thursday, May 18, 2023 10:06 AM > To: [email protected] > Subject: [lustre-discuss] mlx5 errors on oss > > Hello all, > > We recently added infiniband to our cluster and are in the process of testing > it with lustre. We're running the distro provided drivers for the mellanox > cards with the latest firmware. Overnight we started seeing the following > errors on a few oss: > > infiniband mlx5_0: dump_cqe:272:(pid 40058): dump error cqe > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000030: 00 00 00 00 00 00 88 13 08 00 00 a0 00 63 4d d2 infiniband mlx5_0: > dump_cqe:272:(pid 40057): dump error cqe > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000030: 00 00 00 00 00 00 88 13 08 00 00 a1 00 c2 8e d2 infiniband mlx5_0: > dump_cqe:272:(pid 40057): dump error cqe > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000030: 00 00 00 00 00 00 88 13 08 00 00 a2 00 1a 12 d2 > > I found a post suggesting this might be iommu related, disabling the iommu > doesn't seem to help any. > > We're running luster 2.15, more or less at the tip of b2_15 > (b74560d74a9f890838dbf2f0719e3d27c1e5eaf8) > > Has anyone seen this before or have any pointers? > > Thanks > > Shane
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
