On Wed, Feb 5, 2025 at 10:21 AM Laura Hild <[email protected]> wrote: > > I wanna say 2.15 added those messages (the obd_memory ones, not the spinning > ptlrpcd) to every OoM. I remember seeing them when we first had 2.15 clients > and looking them up. I take it you're not getting a corresponding OoM for > each, though?
Thanks, yes what we see is one single OoM instance, which is resolved by oom-killer, and triggers ptlrpcd to then loop forever, spinning a CPU and also spamming the log messages. I guess the oom callback it runs, is just being called over and over? > It is typical for a host to struggle if OoM conditions are happening > regularly. Is there workload manager where you could contain individual > jobs' memory usage, and limit the total to something with a bigger margin for > the system? Right, certainly we are not expecting it to happen often and/or we can arrange to make sure it does not happen, however the fact that one OOM instance causes the server to become unusable and causes process to hang indefinitely is still an issue that would be great to resolve. -Lewis _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
