Hey All,
I've noticed after upgrading to 4.1 to 4.2.3.6 efix17 that a gpfs.snap
now takes a really long time as in... a *really* long time. Digging into
it I can see that the snap command is actually done but the sshd child
is left waiting on a sleep process on the clients (a sleep 600 at that).
Trying to get 3500 nodes snapped in chunks of 64 nodes that each take 10
minutes looks like it'll take a good 10 hours.
It seems the trouble is in the runCommand function in gpfs.snap. The
function creates a child process to act as a sort of alarm to kill the
specified command if it exceeds the timeout. The problem while the alarm
process gets killed the kill signal isn't passed to the sleep process
(because the sleep command is run as a process inside the "alarm" child
shell process).
In gpfs.snap changing this:
[[ -n $sleepingAgentPid ]] && $kill -9 $sleepingAgentPid > /dev/null 2>&1
to this:
[[ -n $sleepingAgentPid ]] && $kill -9 $(findDescendants
$sleepingAgentPid) $sleepingAgentPid > /dev/null 2>&1
seems to fix the behavior.
I'll open a PMR for this shortly but I'm just wondering if anyone else
has seen this.
-Aaron
--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss