[gpfsug-discuss] gpfs.snap taking a really long time (4.2.3.6 efix17)

Aaron Knister Sat, 10 Mar 2018 13:40:39 -0800

Hey All,

I've noticed after upgrading to 4.1 to 4.2.3.6 efix17 that a gpfs.snapnow takes a really long time as in... a *really* long time. Digging intoit I can see that the snap command is actually done but the sshd childis left waiting on a sleep process on the clients (a sleep 600 at that).Trying to get 3500 nodes snapped in chunks of 64 nodes that each take 10minutes looks like it'll take a good 10 hours.

It seems the trouble is in the runCommand function in gpfs.snap. Thefunction creates a child process to act as a sort of alarm to kill thespecified command if it exceeds the timeout. The problem while the alarmprocess gets killed the kill signal isn't passed to the sleep process(because the sleep command is run as a process inside the "alarm" childshell process).


In gpfs.snap changing this:
[[ -n $sleepingAgentPid ]] && $kill -9 $sleepingAgentPid > /dev/null 2>&1

to this:

[[ -n $sleepingAgentPid ]] && $kill -9 $(findDescendants$sleepingAgentPid) $sleepingAgentPid > /dev/null 2>&1


seems to fix the behavior.

I'll open a PMR for this shortly but I'm just wondering if anyone elsehas seen this.


-Aaron


--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

[gpfsug-discuss] gpfs.snap taking a really long time (4.2.3.6 efix17)

Reply via email to