Oleg, Cameron, how to look at counts / list of requests queue (ungranted lock), request wait time ?
Can you please point to parameter names to check first for troubleshooting and to monitor. I’m looking at parameters below but not sure about meaning or entry format. ldlm.lock_granted_count ldlm.services.ldlm_canceld.req_history ldlm.services.ldlm_canceld.stats ldlm.services.ldlm_canceld.timeouts ldlm.services.ldlm_cbd.req_history ldlm.services.ldlm_cbd.stats ldlm.services.ldlm_cbd.timeouts mdt.*.exports.*.ldlm_stats obdfilter.*.exports.*.ldlm_stats Anything to look at `ldlm.namespaces` ? Best regards, Alex. > On Jul 17, 2024, at 20:56, Oleg Drokin via lustre-discuss > <[email protected]> wrote: > > This Message Is From an External Sender > This message came from outside your organization. > On Wed, 2024-07-17 at 12:58 -0700, Cameron Harr via lustre-discuss > wrote: > > In 2017, Oleg gave a talk at ORNL's Lustre conference about LDLM, > > including references to ldlm.lock_limit _mb and > > ldlm.lock_reclaim_threshold_mb. > > > https://urldefense.us/v3/__https://lustre.ornl.gov/ecosystem-2017/documents/Day-2_Tutorial-4_Drokin.pdf__;!!G_uCfscf7eWS!bAKFMeyE7sSlS07D-Xg3QWp90v8S2IQDhmAFhrPR86dHuUwyGB2zJXOZGIHTTrGU0FS2cUWfQJ-zktshrFBJ3NVn6b0RaA$ > > > > > The apparent defaults back then in Lustre 2.8 for those two > > parameters > > were 30MB and 20MB, respectively. On my 2.15 servers with 256GB and > > no > > changes from us, I'm seeing numbers of 77244MB and 51496MB, > > respectively. We recently got ourselves into a situation where a > > subset > > of MDTs appeared to be entirely overwhelmed trying to cancel locks, > > with > > ~500K locks in the request queue but a request wait time of 6000 > > seconds. So, we're looking at potentially limiting the locks on the > > servers. > > > > What's the formula for appropriately sizing ldlm.lock_limit _mb and > > ldlm.lock_reclaim_threshold_mb in 2.15 (I don't think node memory > > amounts have increased 20000X in 7 years)? > > What do you mean by the "locks in the request queue"? If you mean your > server has got that many ungranted locks, there's nothing you can > really do here - that's how many outstanding client requests you've > got. > > Sure, you can turn clients away, but probably could be more productive > to make sure your cancels are quicker? > > I think I've seen cases recently with servers gummed up with requests > for creations being stuck waiting on OSTs to create more objects, while > holding various dlm locks (= other threads that wanted to access these > directories getting stuck too) while OSTs getting super slow because of > an influx of (pretty expensive) destroy requests to delete objects from > unlinked files. > In the end dropping requests in flight from MDTs to OSTs helped much > more by making sure OSTs were doing their creates faster so MDTs were > blocking much less. > _______________________________________________ > lustre-discuss mailing list > > [email protected] > https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G_uCfscf7eWS!bAKFMeyE7sSlS07D-Xg3QWp90v8S2IQDhmAFhrPR86dHuUwyGB2zJXOZGIHTTrGU0FS2cUWfQJ-zktshrFBJ3NWjzv8aOA$ _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
