Bernard,

I believe I have provided all the info that you requested.  Thanks for
your help.

>>Q: Was the job queued or was it running?
    A: Here is an example of Q'd jobs ... When running the cluster test
the job goes from running to Q.  If the job is deleted and              
test_cluster is rerun, the job Qs again just as listed below.   
                                                                        
                       Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time 
S Time
--------------- -------- -------- ---------- ------ --- --- ------ -----
- -----
12.redrock      oscartst workq    shelltest     --    2   1    --  10000
Q   --


>>Q: Contents of /home/oscartst/torque:
    A: -rwxr-xr-x  1 oscartst oscartst  364 Jun  7 09:32
pbs_script.shell
        -rwxr-xr-x  1 oscartst oscartst 1657 Jun  7 09:32 test_root
        -rwxr-xr-x  1 oscartst oscartst 1015 Jun  7 09:32 test_user
        No logs present.

>>Q: Torque related logs in /var/spool/pbs:  
    A: No torque logs found in any of the directories.
        ./pbs:  
              /aux  /checkpoint  /mom_logs  /mom_priv  pbs_environment 
/sched_logs  
              /sched_priv  /server_logs  server_name  /server_priv 
/spool  /undelivered

>>Q: Output of ./test_cluster.
    A: Output below.
#-------------------------- ./test_cluster output
-------------------------------------#
Performing root tests...
Maui service check:maui                                       PASSED
Torque node check                                               PASSED
Torque service check:pbs_server                           PASSED
/home mounts                                                   /home
mounts                    eqoscarnode1.eqoscardomain     /home mounts   
   eqoscarnode2.eqoscardomain     
/home mounts                    PASSED

Preparing user tests...
Performing user tests...
SSH ping test                                                      
PASSED
SSH server->node                                                PASSED
SSH node->server                                                PASSED
Ganglia setup test                                                
PASSED
Ganglia node count test                                         PASSED
Torque default queue definition                               PASSED
Checking for 2 free nodes:                                     FAILED
Not enough free nodes. Tests incomplete.
Checking for 2 free nodes:                                     FAILED
Not enough free nodes. Tests incomplete.
Checking for 2 free nodes:                                     FAILED
Not enough free nodes. Tests incomplete.
Checking for 2 free nodes:                                     FAILED
Not enough free nodes. Tests incomplete.

There were issues running some user test scripts.  Please check your
logs
located in /home/oscartst.

Run APItests...

Running Installation tests for pvm
PASS       2006-06-26T08:22:15Z   pvmd-path-ls.apt
PASS       2006-06-26T08:22:15Z   envvar-pvm_arch.apt
PASS       2006-06-26T08:22:15Z   envvar-pvm_root.apt
PASS       2006-06-26T08:22:15Z   pvmd-path-which.apt
PASS       2006-06-26T08:22:15Z   modulecmd-path-ls.apt
PASS       2006-06-26T08:22:15Z   pvm-module-list.apt
PASS       2006-06-26T08:22:15Z   pvm-module-show-pvm_rsh.apt
PASS       2006-06-26T08:22:15Z   pvm-module-show-pvm_arch.apt
PASS       2006-06-26T08:22:15Z   pvm-module-show-pvm_root.apt

>>> "Bernard Li" <[EMAIL PROTECTED]>  >>>
Hi Tyler:

One of the TORQUE tests is a shelltest, and that is probably the job you
are seeing with qstat.  Was the job queued or was it running?

If you delete it and re-run the tests, does another shelltest get stuck?

Can you post the output of tests?  I'd like to see a complete list of
tests which passed/failed.

Also, did you check /home/oscartst/torque to see if there was anything
there?

You might also look for TORQUE related logs in /var/spool/pbs.

Cheers,

Bernard

-----Original Message-----
From: [EMAIL PROTECTED] on behalf of Tyler
Cruickshank
Sent: Thu 22/06/2006 15:32
To: oscar-users@lists.sourceforge.net
Subject: [Oscar-users] No Free Nodes
 
Hi Bernard et al,
 
Thought I should re-send this message with the results of pbsnodes -a:
 
I successfully tested a 1 node cluster.  I then added a second node and
successfully completed Step 7, complete cluster setup.  I ran into
problems in Step 8, test cluster setup.  I failed on a "Not enough free
nodes" error.  It checks for 2 free nodes, but cant find them.  The
message says to check error logs in /home/oscartst.  The testing PASSes 
through the Torque default queue definition.  I also PASS all of the PVM
installation tests.
 
There are plenty of user list entries on the "not enough free nodes"
error and based on the pieces of advice found there I checked a few
things:
 
1) All directories are created in /home/oscartst, but no .out or .err
files are created in any of the dirs.  There aren't any logs to check.
 
2) http://localhost/ganglia shows that all nodes are up and it indicates
the correct number of nodes.  gexec is listed as OFF.
 
3) qstat -a indicates a single job open.  The job name is shelltest.  I
can delete the job using qdel, but this does not enable me to rerun the
test scripts with success.
 
4) pbsnodes -a indicates that the state of both nodes is free.

5) There was a suggestion to check /etc/hosts ... this looked good with
the nodes and ips added to the file.
 
6) I rebooted the server node and retried the test.  Same problems.
 
Thanks.
 
ty



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to