Here you go. Note that I've got a directory called jobs/killed where it moves everything. Also, it's set up for PBSPro, so you might need to tweak it slightly (I've got my directories hard-coded in there. Should be trivial to change that and make it easier to switch to OpenPBS).


#!/usr/bin/perl


# Removes jobs from the PBS queue that are stuck in "E" (exiting) state

# Grabs all of the E jobs from qstat
# Shuts down PBS while keeping jobs running
# Loops through all of the E jobs
# Deletes the files associated with the E jobs
# Restarts PBS


# GRAB THE JOB NUMBERS OF THE E JOBS FROM QSTAT


# first get the number of jobs, not including the header
$NUMJOBS = (`/usr/pbs/bin/qstat | wc -l`) - 2;

# now get a list of all of the jobs and check to see if
# any of them are in "E" state
@JOBS = (`/usr/pbs/bin/qstat | /usr/bin/tail -$NUMJOBS`);
foreach (@JOBS) {
        @CURRJOB=split;
        if ( $CURRJOB[4] eq 'E' ) {
                $JOBID = substr ($CURRJOB[0],0, -8);
                print "$JOBID\n";
                push (@JOBIDLIST, $JOBID);
                print "@JOBIDLIST\n";
        }
}

# SHUT DOWN PBS
`/usr/pbs/bin/qterm -t quick`;

# MOVE THE "E" JOBS TO THE KILLED DIRECTORY
$PBS_SPOOL = "/var/spool/PBS/server_priv/jobs";
foreach (@JOBIDLIST) {
        `mv $PBS_SPOOL/$_.* $PBS_SPOOL/killed/.`
}

# START UP PBS
`/usr/pbs/sbin/pbs_server`;



On Nov 6, 2003, at 2:34 PM, Bernard Li wrote:

Hey Jenn:

Thanks for you response - it'd be great if you can email me that script - I can give it a shot.

I am just curious as to why jobs would get stuck in the exit state and not actually 'exit'... seems kind of odd.

We tried to use nagios a while back and where it was quite comprehensive we did notice that at times the information it presented wasn't accurate and thus we stopped using it.

Right now we are using ganglia/clumon to monitor our cluster usage and I think they work great. And I would agree that having a web-based application to do 'self-healing' (I guess that is the term) would be great.

I also think it would be beneficial to have a way to submit and kill jobs via the web - I wouldn't be surprised that someone has already written software like that. If not I would surely try to write one in the future.

Cheers,

Bernard

Jenn Sturm wrote:

I wrote a script to basically find jobs in "E" state (grab the job numbers), do qterm -t quick to shutdown the pbs_server but leave jobs running, go to $PBSHOME/server_priv/jobs and remove the job files, and then restart the pbs_server.
The script is clunky (I'm not much of a script writer) but it works. Lemme know if you'd like a copy. Typically we run it by hand whenever we catch E jobs. I'd like, ideally, to have it work out of our monitoring software (Nagios).
-Jenn Sturm
On Nov 6, 2003, at 1:52 PM, Bernard Li wrote:
Typically if there is a scenario where you need to remove every single job from PBS, how would you do it?

I guess one way is to use xpbs and then go select all and then delete, but that GUI takes forever to load - is there a command line way of doing it?

qdel takes multiple jobid's as arguments, I suppose what I can do is run qstat and pipe the first 'column' to qdel, that probably would work.

However, last night we where having issues with jobs in the E (exiting) state not exiting and none of the above mentioned methods could kill them.

So I ended up going into /var/spool/pbs/server_priv and deleting the jobs dir :)

Cheers,

Bernard



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?   SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

+-------------------------------------------------------------------+
Jennifer Sturm
System Administrator and Research Support Specialist
Chemistry Department
Hamilton College
[EMAIL PROTECTED]
[EMAIL PROTECTED]
315-859-4745
http://www.chem.hamilton.edu/
http://mars.chem.hamilton.edu/
+-------------------------------------------------------------------+




+-------------------------------------------------------------------+ Jennifer Sturm System Administrator and Research Support Specialist Chemistry Department Hamilton College

[EMAIL PROTECTED]
[EMAIL PROTECTED]
315-859-4745

http://www.chem.hamilton.edu/
http://mars.chem.hamilton.edu/
+-------------------------------------------------------------------+



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?   SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to