I'm no expert on Torque/Maui (just learning myself), but here are my 2 cents:

In the worst case scenario, there's always kill -9. You can always log on the node in question as root and blow away the offending process. Sooner or later the MOM on that node will realize that the process has died and should cause the eventual purging of the job from the queue either by pbs_server on curie or maui.

If you want to be nice (and depending on how exactly the output file is to be copied off to node4), you may want to try some good old fashioned DNS poisoning. Just use the /etc/hosts file on node12 to temporarily make it think that node4 is some other node (perhaps even itself). If the job in question retries to write out the file at regular intervals on failure, it should work the next time it attempts to copy the output. If the output file is a non-issue, then, like I mentioned above, there's always kill -9...

Cheers,

Rouben Tchakhmakhtchian
[EMAIL PROTECTED]
UTSC Computing & Networking Services
416-208-4732


Paul Van Allsburg wrote:
I have a job that was started from node4, and node4 has gone off line with disk errors. The job ran on node12 and wants to write the final output file via node4, but that node is unavailable and the job sits in exiting state. The cluster is running torque-1.2.0p2 and maui-3.2.6p11.

Job id           Name             User             Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
4351.curie       amberDNA_md11    hinkle           54:03:16 E long

Qdel fails and -p option is not available in this release..

[EMAIL PROTECTED] ~]# qdel 4351
qdel: Request invalid for state of job 4351.curie.chem.hope.edu
[EMAIL PROTECTED] ~]# qdel -p 4351
qdel: invalid option -- p
usage: qdel [-W delay] job_identifier...

I tried canceljob ...

[EMAIL PROTECTED] ~]# canceljob 4351
ERROR:  cannot cancel job '4351'


How can I force this job out of the queues?

Thanks!
Paul


begin:vcard
fn:Rouben Tchakhmakhtchian
n:;Rouben Tchakhmakhtchian
org:University of Toronto at Scarborough;Computing & Networking Services
adr:Room AC207;;1265 Military Trail;Scarborough;ON;M1C 1A4;Canada
email;internet:[EMAIL PROTECTED]
title:HPC Systems Administrator
tel;work:416-208-4732
tel;pager:416-246-4303
x-mozilla-html:FALSE
url:http://www.utsc.utoronto.ca/ccweb
version:2.1
end:vcard

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to