<x-flowed>
pbs_sched is installed as part of pbs, but disabled, since the Maui scheduler is used instead. pbs_sched should be stopped.

Restarting the nodes under running jobs may leave the pbs_server with no client to talk to about a job it thinks is running... The brute force method to kill jobs when all else fails:
_______________________________
cexec -c "service pbs_mom stop" (stop the moms on the nodes)
service pbs_server stop
cd /usr/spool/PBS/server_priv/jobs
ls (you should see job files listed)
rm -f *
cexec -c "service pbs_mom start" (start the moms)
service pbs_server start
pbsnodes -a (check to insure all nodes are "free")
_______________________________

Something to check:
The problem you're having may be due to ssh... for PBS to work, a user should be able to ssh from a compute node to the server.


Jeremy

At 12:01 PM 2/15/2002 -0500, Ron Yang wrote:
Dear OSCAR users,

Is there any known problem with "/usr/local/pbs/sbin/pbs_sched"?

The problem I got have started like this:

I made a PBS test job script to test the application that I use often like the followings.

[oscar@oscar-server ~]$ cat test.job
#PBS -l nodes=5:ppn=2
#PBS -j oe
~/bin/blastall -p tblastn -a 10 -d ~/test/21 -i ~/test/4932.FASTAC -o ~/test/test.out
[oscar@oscar-server ~]$ qsub test.job
12.oscar01
[oscar@oscar-server ~]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
12.oscar01 test.job oscar 02:09:57 R workq
[oscar@oscar-server ~]$

It looks fine, but actually it was only running on the fifth node, which is the last node I have. It suppose to run on all of the five nodes using two CPUs in each node. Anyway, this is not really related to the subject.

Later, I accidently run "~/OSCAR_test/test_cluster" while the job above is running. After a few seconds, I realized that I need to kill the job above to run "test_cluster" properly, so I ran "qdel 12". However, when I do "qstat", it gave me the following screen. I did that for an hour and no change in the screen.

[oscar@oscar-server ~]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
12.oscar01 test.job oscar 02:09:57 E workq
13.oscar01 shelltest oscar 0 R workq
[oscar@oscar-server ~]$

So, I restart all the nodes. But, when I do "qstat" again after the reboot, the jobs are still there. When I do "qdel" it gaves me this:

[oscar@oscar-server ~]$ qdel 13
qdel: Server could not connect to MOM 13.oscar01
[oscar@oscar-server ~]$

So, I did this as root under "/etc/init.d":

[root@oscar-server init.d]$ ./pbs_server status
pbs_server (pid 628) is running...
[root@oscar-server init.d]$ ./pbs_sched status
pbs_sched is stopped
[root@oscar-server init.d]$ ./pbs_sched start
Starting PBS Scheduler: [FAILED]
[root@oscar-server init.d]$ cd /usr/local/pbs/sbin/
[root@oscar-server sbin]$ ./pbs_sched
pbs_sched: Address already in use (98) in main, bind
[root@oscar-server init.d]$

Looks like that "pbs_sched" is in trouble. Due to the lack of understanding PBS, I am not sure if this "pbs_sched" is related to the problem of deleting the job 13. Can anybody help me out with this problem?

Many thanks.

Ron


H. Ronald Yang | DB Developer | www.labbook.com
"It is more important to know where you are going
than to get there quickly." --Mabel Newcomber
"If it's not fun, why do it?" --Jerry Greenfield


_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

</x-flowed>

Reply via email to