Re: [lustre-discuss] 2.16.1 ptlrpcd infinite loop when machine runs out of RAM

Lewis Hyatt Wed, 05 Feb 2025 14:22:53 -0800

On Wed, Feb 5, 2025 at 10:21 AM Laura Hild <[email protected]> wrote:
>
> I wanna say 2.15 added those messages (the obd_memory ones, not the spinning 
> ptlrpcd) to every OoM. I remember seeing them when we first had 2.15 clients 
> and looking them up.  I take it you're not getting a corresponding OoM for 
> each, though?


Thanks, yes what we see is one single OoM instance, which is resolved
by oom-killer, and triggers ptlrpcd to then loop forever, spinning a
CPU and also spamming the log messages. I guess the oom callback it
runs, is just being called over and over?

> It is typical for a host to struggle if OoM conditions are happening 
> regularly.  Is there workload manager where you could contain individual 
> jobs' memory usage, and limit the total to something with a bigger margin for 
> the system?

Right, certainly we are not expecting it to happen often and/or we can
arrange to make sure it does not happen, however the fact that one OOM
instance causes the server to become unusable and causes process to
hang indefinitely is still an issue that would be great to resolve.

-Lewis
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] 2.16.1 ptlrpcd infinite loop when machine runs out of RAM

Reply via email to