Hey All,

I've noticed after upgrading to 4.1 to 4.2.3.6 efix17 that a gpfs.snap now takes a really long time as in... a *really* long time. Digging into it I can see that the snap command is actually done but the sshd child is left waiting on a sleep process on the clients (a sleep 600 at that). Trying to get 3500 nodes snapped in chunks of 64 nodes that each take 10 minutes looks like it'll take a good 10 hours.

It seems the trouble is in the runCommand function in gpfs.snap. The function creates a child process to act as a sort of alarm to kill the specified command if it exceeds the timeout. The problem while the alarm process gets killed the kill signal isn't passed to the sleep process (because the sleep command is run as a process inside the "alarm" child shell process).

In gpfs.snap changing this:
[[ -n $sleepingAgentPid ]] && $kill -9 $sleepingAgentPid > /dev/null 2>&1

to this:
[[ -n $sleepingAgentPid ]] && $kill -9 $(findDescendants $sleepingAgentPid) $sleepingAgentPid > /dev/null 2>&1

seems to fix the behavior.

I'll open a PMR for this shortly but I'm just wondering if anyone else has seen this.

-Aaron


--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to