We use NHC here (Node Health Check) from LBNL and our SS clients are almost all using NFS root. We have a check where we look for access to a couple of dotfiles (we have multiple SS file systems) and will mark a node offline if the checks fail. Many things can contribute to the failure of a single client node as we all know. Our checks are for actual node health on the clients, NOT to assess the health of the File Systems themselves. I will normally see MANY other problems from other monitoring sources long before I normally see stale file handles at the client level.
We did have to turn up the timeout for a check of the file to return on very busy clients, but we've haven't seen slowdowns due to hundreds of nodes all checking the file at the same time. Localized node slowdowns will occasionally mark a node offline for this check here and there (normally a node that is extremely busy), but the next check will put the node right back online in the batch system. Ed Wahl Ohio Supercomputer Center [email protected] ________________________________ From: [email protected] <[email protected]> on behalf of Alexander John Mamach <[email protected]> Sent: Friday, August 9, 2019 1:46 PM To: [email protected] <[email protected]> Subject: [gpfsug-discuss] Checking for Stale File Handles Hi folks, We’re currently investigating a way to check for stale file handles on the nodes across our cluster in a way that minimizes impact to the filesystem and performance. Has anyone found a direct way of doing so? We considered a few methods, including simply attempting to ls a GPFS filesystem from each node, but that might have false positives, (detecting slowdowns as stale file handles), and could negatively impact performance with hundreds of nodes doing this simultaneously. Thanks, Alex Senior Systems Administrator Research Computing Infrastructure Northwestern University Information Technology (NUIT) 2020 Ridge Ave Evanston, IL 60208-4311 O: (847) 491-2219 M: (312) 887-1881 www.it.northwestern.edu
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
