Lewis Hyatt <[email protected] <mailto:[email protected]>> wrote: > I am running a 2.16.1 client > (g:718a3fd9c78507a589a657ec3745b4c553f66633) on Ubuntu 24.04.1 LTS > (kernel 6.8.0-51). The client is using infiniband with in-kernel ofed > drivers. > > I am seeing a mostly reproducible (although subject to race > conditions) issue that occurs when a client server runs out of memory. > I see messages like the following: > > 2025-02-04T18:02:01.300296-05:00 hostname kernel: obd_memory max: > 1282184920, obd_memory current: 11174853 > 2025-02-04T18:02:01.311854-05:00 hostname kernel: obd_memory max: > 1282184920, obd_memory current: 11174453 > 2025-02-04T18:02:01.417172-05:00 hostname kernel: message repeated 14 > times: [ obd_memory max: 1282184920, obd_memory current: 11174453] > 2025-02-04T18:02:01.417173-05:00 hostname kernel: obd_memory max: > 1282184920, obd_memory current: 11174021 > 2025-02-04T18:02:01.417184-05:00 hostname kernel: obd_memory max: > 1282184920, obd_memory current: 11173805 > > which are output to the syslog forever with ptlrpcd spinning 100% CPU, > while lustre client processes which were waiting on I/O are hung.
I don't see any Jira ticket for this, probably better to file one. This OOM callback is from patch https://review.whamcloud.com/42121 ("LU-13594 obdclass: Add OOM handler for obdclass") added in commit v2_14_57-26-g54d4cca6cb, so it would also have been in 2.15.0. Probably something wrong with the callback, or possibly in the kernel? It isn't really clear what the semantics of this function are, and there are few comments that describe its usage. Cheers, Andreas _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
