Hello Simon, Sadly, that "1036" is not a node ID, but just a counter.
These are tricky to troubleshoot. Usually, by the time you realize it's happening and try to collect some data, things have already timed out. Since this mmdelsnapshot isn't something that's on a schedule from cron or the GUI and is a command you are running, you could try some heavy-handed data collection. You suspect a particular fileset already, so maybe have a 'mmdsh -N all lsof /path/to/fileset' ready to go in one window, and the 'mmdelsnapshot' ready to go in another window? When the mmdelsnapshot times out, you can find the nodes it was waiting on in the file system manager mmfs.log.latest and see what matches up with the open files identified by lsof. It sounds like you already know this, but the <c0n42> type of internal node names in the log messages can be translated with 'mmfsadm dump tscomm' or also plain old 'mmdiag --network'. Thanks, Nate Falk IBM Spectrum Scale Level 2 Support Software Defined Infrastructure, IBM Systems From: Simon Thompson <[email protected]> To: gpfsug main discussion list <[email protected]> Date: 02/20/2020 03:14 PM Subject: [EXTERNAL] Re: [gpfsug-discuss] Unkillable snapshots Sent by: [email protected] Hmm ... mmdiag --tokenmgr shows: Server stats: requests 195417431 ServerSideRevokes 120140 nTokens 2146923 nranges 4124507 designated mnode appointed 55481 mnode thrashing detected 1036 So how do I convert "1036" to a node? Simon From: [email protected] <[email protected]> on behalf of Simon Thompson <[email protected]> Sent: 20 February 2020 19:45:02 To: gpfsug main discussion list Subject: [gpfsug-discuss] Unkillable snapshots Hi, We have a snapshot which is stuck in the state "DeleteRequired". When deleting, it goes through the motions but eventually gives up with: Unable to quiesce all nodes; some processes are busy or holding required resources. mmdelsnapshot: Command failed. Examine previous error messages to determine cause. And in the mmfslog on the FS manager there are a bunch of retries and "failure to quesce" on nodes. However in each retry its never the same set of nodes. I suspect we have one HPC job somewhere killing us. What's interesting is that we can delete other snapshots OK, it appears to be one particular fileset. My old goto "mmfsadm dump tscomm" isn't showing any particular node, and waiters around just tend to point to the FS manager node. So ... any suggestions? I'm assuming its some workload holding a lock open or some such, but tracking it down is proving elusive! Generally the FS is also "lumpy" ... at times it feels like a wifi connection on a train using a terminal, I guess its all related though. Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p3ZFejMgr8nrtvkuBSxsXg&m=rIyEAXKyzwEj_pyM9DRQ1mL3x5gHjoqSpnhqxP6Oj-8&s=ZRXJm9u1_WLClH0Xua2PeIr-cWHj8YasvQCwndgdyns&e=
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
