I couldn't say exactly but..

   - Your net is o2ib1. Is there an o2ib0?
   - Are you routing? If so, lnet routing or IB routing? Any issues with
   the routers or routing?
   - Verify the stability of lnet and the fabric path between client and
   server in the messages above using a tool like lnet_selftest?
   - Verify the fabric: Check error counters on the switch and HCA ports
   involved. Use non-Lustre IB tools (ib_send_bw, etc) to test the fabric.

Lustre can, and will tell you when lnet issue arise but it cannot tell you
anything about the actual network layer it is riding on so it is usually a
good idea to certify function of the network layer first before delving
into "what LBUG is running my weekend plans?"

I hope that helps,

--Jeff

(resent to list in hopes of being beneficial to others)

On Thu, Oct 5, 2023 at 9:34 AM Alastair Basden via lustre-discuss <
lustre-discuss@lists.lustre.org> wrote:

> Hi,
>
> Lustre 2.12.2.
>
> We are seeing lots of errors on the servers such as:
> Oct  5 11:16:48 oss04 kernel: LNetError:
> 6414:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending
> PUT to 12345-172.19.171.15@o2ib1: -125
> Oct  5 11:16:48 oss04 kernel: LustreError:
> 6414:0:(events.c:450:server_bulk_callback()) event type 5, status -125,
> desc ffff8fe066bb9400
>
> and
> Oct  4 14:59:48 oss04 kernel: LustreError:
> 6383:0:(events.c:305:request_in_callback()) event type 2, status -103,
> service ost_io
>
> and
> Oct  5 11:18:06 oss04 kernel: LustreError:
> 6388:0:(events.c:305:request_in_callback()) event type 2, status -5,
> service ost_io
> Oct  5 11:18:06 oss04 kernel: LNet:
> 6412:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from
> 172.19.171.15@o2ib1
>
> and on the clients:
> m7: Oct  5 14:46:59 m7132 kernel: LustreError:
> 2466:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
> desc ffff9a251fc14400
>
> and
> m7: Oct  5 11:18:34 m7086 kernel: LustreError:
> 2495:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc
> ffff9a39ad668000
>
> Does anyone have any ideas about what could be causing this?
>
> Thanks,
> Alastair.
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>


-- 
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to