On Wed, 2024-07-17 at 12:58 -0700, Cameron Harr via lustre-discuss wrote: > In 2017, Oleg gave a talk at ORNL's Lustre conference about LDLM, > including references to ldlm.lock_limit _mb and > ldlm.lock_reclaim_threshold_mb. > https://lustre.ornl.gov/ecosystem-2017/documents/Day-2_Tutorial-4_Drokin.pdf > > The apparent defaults back then in Lustre 2.8 for those two > parameters > were 30MB and 20MB, respectively. On my 2.15 servers with 256GB and > no > changes from us, I'm seeing numbers of 77244MB and 51496MB, > respectively. We recently got ourselves into a situation where a > subset > of MDTs appeared to be entirely overwhelmed trying to cancel locks, > with > ~500K locks in the request queue but a request wait time of 6000 > seconds. So, we're looking at potentially limiting the locks on the > servers. > > What's the formula for appropriately sizing ldlm.lock_limit _mb and > ldlm.lock_reclaim_threshold_mb in 2.15 (I don't think node memory > amounts have increased 20000X in 7 years)?
What do you mean by the "locks in the request queue"? If you mean your server has got that many ungranted locks, there's nothing you can really do here - that's how many outstanding client requests you've got. Sure, you can turn clients away, but probably could be more productive to make sure your cancels are quicker? I think I've seen cases recently with servers gummed up with requests for creations being stuck waiting on OSTs to create more objects, while holding various dlm locks (= other threads that wanted to access these directories getting stuck too) while OSTs getting super slow because of an influx of (pretty expensive) destroy requests to delete objects from unlinked files. In the end dropping requests in flight from MDTs to OSTs helped much more by making sure OSTs were doing their creates faster so MDTs were blocking much less. _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
