We use NHC here (Node Health Check) from LBNL and our SS clients are almost all 
using NFS root.   We have a check where we look for access to a couple of 
dotfiles (we have multiple SS file systems) and will mark a node offline if the 
checks fail.
Many things can contribute to the failure of a single client node as we all 
know.  Our checks are for actual node health on the clients, NOT to assess the 
health of the File Systems themselves.  I will normally see MANY other problems 
from other monitoring sources long before I normally see stale file handles at 
the client level.

We did have to turn up the timeout for a check of the file to return on very 
busy clients, but we've haven't seen slowdowns due to hundreds of nodes all 
checking the file at the same time.  Localized node slowdowns will occasionally 
mark a node offline for this check here and there (normally a node that is 
extremely busy), but the next check will put the node right back online in the 
batch system.

Ed Wahl
Ohio Supercomputer Center
[email protected]

________________________________
From: [email protected] 
<[email protected]> on behalf of Alexander John Mamach 
<[email protected]>
Sent: Friday, August 9, 2019 1:46 PM
To: [email protected] <[email protected]>
Subject: [gpfsug-discuss] Checking for Stale File Handles


Hi folks,



We’re currently investigating a way to check for stale file handles on the 
nodes across our cluster in a way that minimizes impact to the filesystem and 
performance.



Has anyone found a direct way of doing so? We considered a few methods, 
including simply attempting to ls a GPFS filesystem from each node, but that 
might have false positives, (detecting slowdowns as stale file handles), and 
could negatively impact performance with hundreds of nodes doing this 
simultaneously.



Thanks,



Alex



Senior Systems Administrator

Research Computing Infrastructure
Northwestern University Information Technology (NUIT)

2020 Ridge Ave
Evanston, IL 60208-4311

O: (847) 491-2219
M: (312) 887-1881
www.it.northwestern.edu


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to