I have followed the instructions on how to setup a local cluster closely (http://wiki.galaxyproject.org/Admin/Config/Performance/Cluster). Frankly, the (Galaxy Configuration) section was not clear for me. I am not sure if those outlined steps should be applied to the server's universe_wsgi.ini or to the nodes'? So I might have overlooked some steps there but here is a summary of what I am doing at the server:
All my nodes have been configured so the galaxy user can ssh/scp between nodes without pwd. Also the galaxy is a sudo in the galaxy (server). Torque has been configured and tested between the nodes. So the cluster is working fine. In universal_wsgi.ini file at the server node -------------------------- Start_job_runners=pbs,drama Drama_external_runjob_~ Drama_external_killer~ External_chown_~ Pbs_application_server = galaxyhost (server) Pbs_stage_path=/tmp/galaxy_stage/ Pbs_dataset_server= galaxyhost (server) ##This is the same like pbs_application_server Also, ln -s /nfsexport/galaxy_stage /usr/local/galaxy/galaxy-dis/database/tmp outputs_to_working_directory= False (if I changed this to True, the galaxy will not start) --------------------------------------------- After restarting the galaxy at the server node. The job seems to be submitted and its status is "R". When I "top" the processes on the node where the job was sent to, I see two processes; ssh and scp ran by the galaxy server. This tells me something is being copied over to the node. But I am not sure what and to where? After while the job status changed to "W". qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 68.ngsgalaxy01 ...x...@idtdna.com galaxy 0 W batch Here is what I say from the log when the job is sent. >>>>>>>>>>>>>>>>>> galaxy.jobs DEBUG 2013-03-15 15:34:54,183 (341) Working directory for job is: /usr/local/galaxy/galaxy-dist/database/job_working_directory/000/341 galaxy.jobs.handler DEBUG 2013-03-15 15:34:54,183 dispatching job 341 to pbs runner galaxy.jobs.handler INFO 2013-03-15 15:34:54,231 (341) Job dispatched galaxy.tools DEBUG 2013-03-15 15:34:54,309 Building dependency shell command for dependency 'samtools' galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:34:54,391 (341) submitting file /usr/local/galaxy/galaxy-dist/database/pbs/341.sh galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:34:54,391 (341) command is: PACKAGE_BASE=/usr/local/galaxy/software/samtools/0.1.16; export PACKAGE_BASE; . /usr/local/galaxy/software/samtools/0.1.16/env.sh; samtools flagstat "/usr/local/galaxy/galaxy-dist/database/files/000/dataset_319.dat" > "/usr/local/galaxy/galaxy-dist/database/files/000/dataset_384.dat" galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:34:54,394 (341) queued in default queue as 70.ngsgalaxy01.idtdna.com galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:34:54,966 (341/70.ngsgalaxy01.idtdna.com) PBS job state changed from N to R >>>>>>>>>>>>>>>>>> Here is the log when the ssh/scp on the node is finished. >>>>>>>>>>>>>>>>>>>> galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:37:00,815 (341/70.ngsgalaxy01.idtdna.com) PBS job state changed from R to W >>>>>>>>>>>>>>>>>>>> Here is the log when I qdel that job >>>>>>>>>>>>>>>>>>>> galaxy.jobs.runners.pbs WARNING 2013-03-15 15:39:20,016 Exit code was invalid. Using 0. galaxy.jobs DEBUG 2013-03-15 15:39:20,033 (341) Changing ownership of working directory with: /usr/bin/sudo -E scripts/external_chown_script.py /usr/local/galaxy/galaxy-dist/database/job_working_directory/000/341 galaxy 10020 galaxy.jobs ERROR 2013-03-15 15:39:20,071 (341) Failed to change ownership of /usr/local/galaxy/galaxy-dist/database/job_working_directory/000/341, failing Traceback (most recent call last): File "/usr/local/galaxy/galaxy-dist/lib/galaxy/jobs/__init__.py", line 336, in finish self.reclaim_ownership() File "/usr/local/galaxy/galaxy-dist/lib/galaxy/jobs/__init__.py", line 909, in reclaim_ownership self._change_ownership( self.galaxy_system_pwent[0], str( self.galaxy_system_pwent[3] ) ) File "/usr/local/galaxy/galaxy-dist/lib/galaxy/jobs/__init__.py", line 895, in _change_ownership assert p.returncode == 0 AssertionError galaxy.datatypes.metadata DEBUG 2013-03-15 15:39:20,160 Cleaning up external metadata files galaxy.jobs.runners.pbs WARNING 2013-03-15 15:39:20,172 Unable to cleanup: [Errno 2] No such file or directory: '/usr/local/galaxy/galaxy-dist/database/pbs/341.o' galaxy.jobs.runners.pbs WARNING 2013-03-15 15:39:20,173 Unable to cleanup: [Errno 2] No such file or directory: '/usr/local/galaxy/galaxy-dist/database/pbs/341.e' galaxy.jobs.runners.pbs WARNING 2013-03-15 15:39:20,173 Unable to cleanup: [Errno 2] No such file or directory: '/usr/local/galaxy/galaxy-dist/database/pbs/341.ec' 10.7.10.201 - - [15/Mar/2013:15:39:22 -0500] "GET /api/histories/5a1cff6882ddb5b2 HTTP/1.0" 200 - "http://10.7.10.31/history" "Mozilla/5.0 (Windows NT 5.2; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0" 10.7.10.201 - - [15/Mar/2013:15:39:22 -0500] "GET /api/histories/5a1cff6882ddb5b2/contents?ids=bbbfa414ae315caf HTTP/1.0" 200 - "http://10.7.10.31/history" "Mozilla/5.0 (Windows NT 5.2; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0" >>>>>>>>>>>>>>>>>>>>>> Is there anything I am not doing or doing wrong? Regards,
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/