I can't comment on the specific network issue, but in general it is far better to use the MOFED drivers than the in-kernel ones.
Cheers, Andreas > On May 18, 2023, at 09:08, Nehring, Shane R [LAS] via lustre-discuss > <[email protected]> wrote: > > Hello all, > > We recently added infiniband to our cluster and are in the process of testing > it > with lustre. We're running the distro provided drivers for the mellanox cards > with the latest firmware. Overnight we started seeing the following errors on > a > few oss: > > infiniband mlx5_0: dump_cqe:272:(pid 40058): dump error cqe > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000030: 00 00 00 00 00 00 88 13 08 00 00 a0 00 63 4d d2 > infiniband mlx5_0: dump_cqe:272:(pid 40057): dump error cqe > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000030: 00 00 00 00 00 00 88 13 08 00 00 a1 00 c2 8e d2 > infiniband mlx5_0: dump_cqe:272:(pid 40057): dump error cqe > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000030: 00 00 00 00 00 00 88 13 08 00 00 a2 00 1a 12 d2 > > I found a post suggesting this might be iommu related, disabling the iommu > doesn't seem to help any. > > We're running luster 2.15, more or less at the tip of b2_15 > (b74560d74a9f890838dbf2f0719e3d27c1e5eaf8) > > Has anyone seen this before or have any pointers? > > Thanks > > Shane > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
