I found the pbs cleanup on node12 hung on:

32316 ?        S      0:00 /usr/local/sbin/pbs_mom
32317 ? S 0:00 /usr/local/sbin/pbs_rcp -r /var/spool/PBS/spool/4351.curie..OU hinkle curie04 /home/hinkle/DNA/AmberDNA/amberDNA_md11.o4351

kill -9 32317  did the trick, and the job was cleared from the queue.
Thanks for the 2cents!
Paul



Rouben Tchakhmakhtchian wrote:
I'm no expert on Torque/Maui (just learning myself), but here are my 2 cents:

In the worst case scenario, there's always kill -9. You can always log on the node in question as root and blow away the offending process. Sooner or later the MOM on that node will realize that the process has died and should cause the eventual purging of the job from the queue either by pbs_server on curie or maui.

If you want to be nice (and depending on how exactly the output file is to be copied off to node4), you may want to try some good old fashioned DNS poisoning. Just use the /etc/hosts file on node12 to temporarily make it think that node4 is some other node (perhaps even itself). If the job in question retries to write out the file at regular intervals on failure, it should work the next time it attempts to copy the output. If the output file is a non-issue, then, like I mentioned above, there's always kill -9...

Cheers,

Rouben Tchakhmakhtchian
[EMAIL PROTECTED]
UTSC Computing & Networking Services
416-208-4732


Paul Van Allsburg wrote:

I have a job that was started from node4, and node4 has gone off line with disk errors. The job ran on node12 and wants to write the final output file via node4, but that node is unavailable and the job sits in exiting state. The cluster is running torque-1.2.0p2 and maui-3.2.6p11.

Job id           Name             User             Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
4351.curie       amberDNA_md11    hinkle           54:03:16 E long

Qdel fails and -p option is not available in this release..

[EMAIL PROTECTED] ~]# qdel 4351
qdel: Request invalid for state of job 4351.curie.chem.hope.edu
[EMAIL PROTECTED] ~]# qdel -p 4351
qdel: invalid option -- p
usage: qdel [-W delay] job_identifier...

I tried canceljob ...

[EMAIL PROTECTED] ~]# canceljob 4351
ERROR:  cannot cancel job '4351'


How can I force this job out of the queues?

Thanks!
Paul

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to