I was surprised to read that Ctrl-C did not really kill restripe. It's supposed to! If it doesn't that's a bug.
I ran this by my expert within IBM and he wrote to me: First of all a "PIT job" such as restripe, deldisk, delsnapshot, and such should be easy to stop by ^C the management program that started them. The SG manager daemon holds open a socket to the client program for the purposes of sending command output, progress updates, error messages and the like. The PIT code checks this socket periodically and aborts the PIT process cleanly if the socket is closed. If this cleanup doesn't occur, it is a bug and should be worth reporting. However, there's no exact guarantee on how quickly each thread on the SG mgr will notice and then how quickly the helper nodes can be stopped and so forth. The interval between socket checks depends among other things on how long it takes to process each file, if there are a few very large files, the delay can be significant. In the limiting case, where most of the FS storage is contained in a few files, this mechanism doesn't work [elided] well. So it can be quite involved and slow sometimes to wrap up a PIT operation. The simplest way to determine if the command has really stopped is with the mmdiag --commands issued on the SG manager node. This shows running commands with the command line, start time, socket, flags, etc. After ^Cing the client program, the entry here should linger for a while, then go away. When it exits you'll see an entry in the GPFS log file where it fails with err 50. If this doesn't stop the command after a while, it is worth looking into. If the command wasn't issued on the SG mgr node and you can't find the where the client command is running, the socket is still a useful hint. While tedious, it should be possible to trace this socket back to node where that command was originally run using netstat or equivalent. Poking around inside a GPFS internaldump will also provide clues; there should be an outstanding sgmMsgSGClientCmd command listed in the dump tscomm section. Once you find it, just 'kill `pidof mmrestripefs` or similar. I'd like to warn the OP away from mmfsadm test pit. These commands are of course unsupported and unrecommended for any purpose (even internal test and development purposes, as far as I know). You are definitely working without a net there. When I was improving the integration between PIT and snapshot quiesce a few years ago, I looked into this and couldn't figure out how to (easily) make these stop and resume commands safe to use, so as far as I know they remain unsafe. The list command, however, is probably fairly okay; but it would probably be better to use mmfsadm saferdump pit. From: Aaron Knister <[email protected]> To: <[email protected]> Date: 08/15/2016 10:49 PM Subject: [gpfsug-discuss] mmfsadm test pit Sent by: [email protected] I just discovered this interesting gem poking at mmfsadm: test pit fsname list|suspend|status|resume|stop [jobId] There have been times where I've kicked off a restripe and either intentionally or accidentally ctrl-c'd it only to realize that many times it's disappeared into the ether and is still running. The only way I've known so far to stop it is with a chgmgr. A far more painful instance happened when I ran a rebalance on an fs w/more than 31 nsds using more than 31 pit workers and hit *that* fun APAR which locked up access for a single filesystem to all 3.5k nodes. We spent 48 hours round the clock rebooting nodes as jobs drained to clear it up. I would have killed in that instance for a way to cancel the PIT job (the chmgr trick didn't work). It looks like you might actually be able to do this with mmfsadm, although how wise this is, I do not know (kinda curious about that). Here's an example. I kicked off a restripe and then ctrl-c'd it on a client node. Then ran these commands from the fs manager: root@loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal list JobId 785979015170 PitJobStatus PIT_JOB_RUNNING progress 0.00 debug: statusListP D40E2C70 root@loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal stop 785979015170 debug: statusListP 0 root@loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal list JobId 785979015170 PitJobStatus PIT_JOB_STOPPING progress 4.01 debug: statusListP D4013E70 ... some time passes ... root@loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal list debug: statusListP 0 Interesting. -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
