Today I ran into a cluster error on our local instance using latest galaxy-dist and torque/pbs with the python-pbs binding.

Under heavy load of the galaxy process, it appears that the handler processes failed to contact the pbs-server, although the pbs_server was still up and running. after that, a lot of the following statements kept appearing in the handler.log file:

galaxy.jobs.runners.pbs DEBUG 2012-07-11 17:39:06,649 (11647/12788.pbs_master_address) Skipping state check because PBS server connection failed

After restarting the galaxy process (run.sh), everything worked again, with no changes to the pbs_server.

Would it be possible to setup some checks for this failure? Like:
 - contact system admin
 - restart galaxy
 - auto retry job submission after a while as to not crash workflows.

best regards,

Geert Vandeweyer


Geert Vandeweyer, Ph.D.
Department of Medical Genetics
University of Antwerp
Prins Boudewijnlaan 43
2650 Edegem
Tel: +32 (0)3 275 97 56
E-mail: geert.vandewe...@ua.ac.be

Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:


Reply via email to