Hi, Galaxy Developers,

I have what I'm hoping is a fairly simple inquiry for the Galaxy community; 
basically, our production Galaxy server processes appear to be dying off over 
time.  Our production Galaxy instance implements apache web scaling features so 
I have a number of server processes, for example my apache Apache configuration 
has:

BalancerMember http://127.0.0.1:8080
BalancerMember http://127.0.0.1:8081
BalancerMember http://127.0.0.1:8082
BalancerMember http://127.0.0.1:8083
BalancerMember http://127.0.0.1:8084
BalancerMember http://127.0.0.1:8085

Nothing unconventional as I understand it.  Similarly, my galaxy config has 
matching [server:ws3], [server:ws2] configuration blocks for each of these 
processes.  When I restart Galaxy, everything is all fine and good.  I'll see a 
server listening on each one of these ports (if I do something like lsof -i TCP 
-P, for example).  What appears to be happening, is that for whatever reason, 
these server processes seem to die off over time (i.e eventually nothing is 
listening on ports 8080-8085).  This process can take days, and at the time 
when no servers are available, Apache will begin throwing 503 service 
unavailable errors.   I am fairly confident this process is gradual, for 
example I just checked now and the Galaxy was still available, however one 
server had died (the one on TCP port 8082).  I do do have a single separate job 
manager and two job handlers; at this point I believe this problem to be 
related to the servers only (i.e. the job manager and job handlers do not app!
 ear to be crashing).

Now, I believe that late last week I might have 'caught' the last server 
process dying, just by coincidence, although I am not 100% certain.  Here is 
the Traceback as it occurred:

galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:47:12,011 (6822/39485.sc01) PBS job 
state changed from Q to R
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:36,565 (6822/39485.sc01) PBS job 
state changed from R to C
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:36,566 (6822/39485.sc01) PBS job 
has completed successfully
galaxy.jobs DEBUG 2013-07-02 08:54:36,685 Tool did not define exit code or 
stdio handling; checking stderr for success
galaxy.datatypes.metadata DEBUG 2013-07-02 08:54:36,812 loading metadata from 
file for: HistoryDatasetAssociation 6046
galaxy.jobs DEBUG 2013-07-02 08:54:38,153 job 6822 ended
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:49,130 (6812/39473.sc01) PBS job 
state changed from R to E
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:52,267 (6812/39473.sc01) PBS job 
state changed from E to C
galaxy.jobs.runners.pbs ERROR 2013-07-02 08:54:52,267 (6812/39473.sc01) PBS job 
failed: Unknown error: -11
galaxy.jobs.runners ERROR 2013-07-02 08:54:52,267 (unknown) Unhandled exception 
calling fail_job
Traceback (most recent call last):
  File "/group/galaxy/galaxy-dist/lib/galaxy/jobs/runners/__init__.py", line 
58, in run_next
    method(arg)
  File "/group/galaxy/galaxy-dist/lib/galaxy/jobs/runners/pbs.py", line 560, in 
fail_job
    if pbs_job_state.stop_job:
AttributeError: 'AsynchronousJobState' object has no attribute 'stop_job'

Now, I have some questions regarding this issue;

1) It appears to me that although this is a sub-optimal solution, restarting 
Galaxy solves this problem (i.e. server processes will be listening after 
restarting Galaxy).   Is it possible, or safe, or sane to just restart a single 
server on a singe port?  Ideally I would actually like to fix the problem that 
is causing my server processes to crash, although I figured it wouldn't hurt to 
ask this question regardless.
2) Similar to the question above, is it possible to configure Galaxy in a way 
that server processes re-spawn  in a self-service manner (i.e. is this a 
feature of Galaxy, for example, because server processes dying regularly is 
either a known issue or expected and tolerable (but undesired) behaivor)?
3) To me, the error messages above aren't very meaningful, other than the 
Traceback appears to be PBS-related.  Would anybody be able comment on the 
problem above (i.e. have you seen something like this), or comment on Galaxy 
server processes dying in general?  I have done some brief searching of the 
Galaxy mailing list for server crashes and did not find anything suggesting 
this is a common problem.
4) I am not 100% confident at this point that the Traceback above is what 
killed the server process.  Does anybody know of a specific string I can search 
for (a literal) to identify when a server process actually dies?  I have done 
some basic review of log data (our Galaxy server generates lots of logs), and 
Traceback does not appear to be a valid string to uniquely identify a server 
crash (they occur too frequently).  I currently have logging configured at 
DEBUG.

In case this is relevant, I am using the following change set for Galaxy:
> hg parents
changeset:   9320:47ddf167c9f1
branch:      stable
tag:         tip
user:        Nate Coraor <n...@bx.psu.edu>
date:        Wed May 01 09:50:31 2013 -0400
summary:     Use Galaxy's ErrorMiddleware since Paste's doesn't return 
start_response.  Fixes downloading tarballs from the Tool Shed when use_debug = 
false.
> 

I appreciate the time you took in reading my email, and any expertise you could 
provide in helping me troubleshoot this issue.  

Dan Sullivan








___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to