I just discovered this interesting gem poking at mmfsadm:

 test pit fsname list|suspend|status|resume|stop [jobId]

There have been times where I've kicked off a restripe and either intentionally or accidentally ctrl-c'd it only to realize that many times it's disappeared into the ether and is still running. The only way I've known so far to stop it is with a chgmgr.

A far more painful instance happened when I ran a rebalance on an fs w/more than 31 nsds using more than 31 pit workers and hit *that* fun APAR which locked up access for a single filesystem to all 3.5k nodes. We spent 48 hours round the clock rebooting nodes as jobs drained to clear it up. I would have killed in that instance for a way to cancel the PIT job (the chmgr trick didn't work). It looks like you might actually be able to do this with mmfsadm, although how wise this is, I do not know (kinda curious about that).

Here's an example. I kicked off a restripe and then ctrl-c'd it on a client node. Then ran these commands from the fs manager:

root@loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal list
JobId 785979015170 PitJobStatus PIT_JOB_RUNNING progress 0.00
debug: statusListP D40E2C70

root@loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal stop 785979015170
debug: statusListP 0

root@loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal list
JobId 785979015170 PitJobStatus PIT_JOB_STOPPING progress 4.01
debug: statusListP D4013E70

... some time passes ...

root@loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal list
debug: statusListP 0

Interesting.

-Aaron

--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to