Hi Nate,

So we're trying to clean up snapshots from the GUI ... we've found that if it 
fails to delete one night for whatever reason, it then doesn't go back another 
day and clean up 😊


But yes, essentially running this by hand to clean up.


What I have found is that lsof hangs on some of the "suspect" nodes. But if I 
strace it, its hanging on a process which is using a different fileset. For 
example, the file-set we can't delete is:


rds-projects-b which is mounted as /rds/projects/b


But on some suspect nodes, strace lsof /rds, that hangs at a process which has 
open files in:

/rds/projects/g which is a different file-set.


What I'm wondering if its these hanging processes in the "g" fileset which is 
killing us rather than something in the "b" fileset. Looking at the "g" 
processes, they look like a weather model and look to be dumping a lot of files 
in a shared directory, so I wonder if the mmfsd process is busy servicing that 
and so whilst its not got "b" locks, its just too slow to respond?


Does that sound plausible?


Thanks


Simon

________________________________
From: [email protected] 
<[email protected]> on behalf of [email protected] 
<[email protected]>
Sent: 20 February 2020 21:26:39
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] Unkillable snapshots

Hello Simon,

Sadly, that "1036" is not a node ID, but just a counter.

These are tricky to troubleshoot. Usually, by the time you realize it's 
happening and try to collect some data, things have already timed out.

Since this mmdelsnapshot isn't something that's on a schedule from cron or the 
GUI and is a command you are running, you could try some heavy-handed data 
collection.

You suspect a particular fileset already, so maybe have a 'mmdsh -N all lsof 
/path/to/fileset' ready to go in one window, and the 'mmdelsnapshot' ready to 
go in another window? When the mmdelsnapshot times out, you can find the nodes 
it was waiting on in the file system manager mmfs.log.latest and see what 
matches up with the open files identified by lsof.

It sounds like you already know this, but the <c0n42> type of internal node 
names in the log messages can be translated with 'mmfsadm dump tscomm' or also 
plain old 'mmdiag --network'.

Thanks,

Nate Falk
IBM Spectrum Scale Level 2 Support
Software Defined Infrastructure, IBM Systems







From:        Simon Thompson <[email protected]>
To:        gpfsug main discussion list <[email protected]>
Date:        02/20/2020 03:14 PM
Subject:        [EXTERNAL] Re: [gpfsug-discuss] Unkillable snapshots
Sent by:        [email protected]
________________________________



Hmm ... mmdiag --tokenmgr shows:

    Server stats: requests 195417431 ServerSideRevokes 120140
           nTokens 2146923 nranges 4124507
           designated mnode appointed 55481 mnode thrashing detected 1036

So how do I convert "1036" to a node?

Simon

________________________________

From: [email protected] 
<[email protected]> on behalf of Simon Thompson 
<[email protected]>
Sent: 20 February 2020 19:45:02
To: gpfsug main discussion list
Subject: [gpfsug-discuss] Unkillable snapshots


Hi,

We have a snapshot which is stuck in the state "DeleteRequired". When deleting, 
it goes through the motions but eventually gives up with:

Unable to quiesce all nodes; some processes are busy or holding required 
resources.
mmdelsnapshot: Command failed. Examine previous error messages to determine 
cause.

And in the mmfslog on the FS manager there are a bunch of retries and "failure 
to quesce" on nodes. However in each retry its never the same set of nodes. I 
suspect we have one HPC job somewhere killing us.

What's interesting is that we can delete other snapshots OK, it appears to be 
one particular fileset.

My old goto "mmfsadm dump tscomm" isn't showing any particular node, and 
waiters around just tend to point to the FS manager node.

So ... any suggestions? I'm assuming its some workload holding a lock open or 
some such, but tracking it down is proving elusive!

Generally the FS is also "lumpy" ... at times it feels like a wifi connection 
on a train using a terminal, I guess its all related though.

Thanks

Simon

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to