I found the pbs cleanup on node12 hung on:
32316 ? S 0:00 /usr/local/sbin/pbs_mom
32317 ? S 0:00 /usr/local/sbin/pbs_rcp -r
/var/spool/PBS/spool/4351.curie..OU hinkle curie04
/home/hinkle/DNA/AmberDNA/amberDNA_md11.o4351
kill -9 32317 did the trick, and the job was cleared from the queue.
Thanks for the 2cents!
Paul
Rouben Tchakhmakhtchian wrote:
I'm no expert on Torque/Maui (just learning myself), but here are my 2
cents:
In the worst case scenario, there's always kill -9. You can always log
on the node in question as root and blow away the offending process.
Sooner or later the MOM on that node will realize that the process has
died and should cause the eventual purging of the job from the queue
either by pbs_server on curie or maui.
If you want to be nice (and depending on how exactly the output file is
to be copied off to node4), you may want to try some good old fashioned
DNS poisoning. Just use the /etc/hosts file on node12 to temporarily
make it think that node4 is some other node (perhaps even itself). If
the job in question retries to write out the file at regular intervals
on failure, it should work the next time it attempts to copy the output.
If the output file is a non-issue, then, like I mentioned above, there's
always kill -9...
Cheers,
Rouben Tchakhmakhtchian
[EMAIL PROTECTED]
UTSC Computing & Networking Services
416-208-4732
Paul Van Allsburg wrote:
I have a job that was started from node4, and node4 has gone off line
with disk errors. The job ran on node12 and wants to write the final
output file via node4, but that node is unavailable and the job sits
in exiting state. The cluster is running torque-1.2.0p2 and
maui-3.2.6p11.
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
4351.curie amberDNA_md11 hinkle 54:03:16 E long
Qdel fails and -p option is not available in this release..
[EMAIL PROTECTED] ~]# qdel 4351
qdel: Request invalid for state of job 4351.curie.chem.hope.edu
[EMAIL PROTECTED] ~]# qdel -p 4351
qdel: invalid option -- p
usage: qdel [-W delay] job_identifier...
I tried canceljob ...
[EMAIL PROTECTED] ~]# canceljob 4351
ERROR: cannot cancel job '4351'
How can I force this job out of the queues?
Thanks!
Paul
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers