I appreciate you taking the time to answer my inquiry. I did know about
the options that technically should not be set on a production server (i.e.
debug), although I have to admit I did not fully understand the
implications of this (i.e. "your Galaxy process may run out of memory if
it's serving large files."). I think what probably happened is that this
setting was turned on at some point to troubleshoot a problem, however it
was never subsequently disabled. I am going to start by disabling the
developer settings as you specified, and then depending on the result of
this, considering upgrading to a newer version of Galaxy. Again, I want to
express my gratitude for you taking the time to respond to my email. I'll
try to remember to post back to the mailing list to report my findings.
On Mon, Jul 8, 2013 at 3:33 AM, Ross <ross.laza...@gmail.com> wrote:
> Hi Dan,
> That's old code. Updating will probably help.
> Logging level just takes disk space, but just in case you haven't followed
> debug = True uncommented used to fill all available server process RAM
> eventually AFAIK.
> If not already done, try
> # Debug enables access to various config options useful for development and
> # debugging: use_lint, use_profile, use_printdebug and use_interactive. It
> # also causes the files used by PBS/SGE (submission script, output, and
> # to remain on disk after the job is complete. Debug mode is disabled if
> # commented, but is uncommented by default in the sample config.
> # debug = True
> Hope this helps?
> On Mon, Jul 8, 2013 at 4:26 PM, Dan Sullivan <dansulli...@gmail.com>wrote:
>> Hi, Galaxy Developers,
>> I have what I'm hoping is a fairly simple inquiry for the Galaxy
>> community; basically, our production Galaxy server processes appear to be
>> dying off over time. Our production Galaxy instance implements apache web
>> scaling features so I have a number of server processes, for example my
>> apache Apache configuration has:
>> BalancerMember http://127.0.0.1:8080
>> BalancerMember http://127.0.0.1:8081
>> BalancerMember http://127.0.0.1:8082
>> BalancerMember http://127.0.0.1:8083
>> BalancerMember http://127.0.0.1:8084
>> BalancerMember http://127.0.0.1:8085
>> Nothing unconventional as I understand it. Similarly, my galaxy config
>> has matching [server:ws3], [server:ws2] configuration blocks for each of
>> these processes. When I restart Galaxy, everything is all fine and good.
>> I'll see a server listening on each one of these ports (if I do something
>> like lsof -i TCP -P, for example). What appears to be happening, is that
>> for whatever reason, these server processes seem to die off over time (i.e
>> eventually nothing is listening on ports 8080-8085). This process can take
>> days, and at the time when no servers are available, Apache will begin
>> throwing 503 service unavailable errors. I am fairly confident this
>> process is gradual, for example I just checked now and the Galaxy was still
>> available, however one server had died (the one on TCP port 8082). I do do
>> have a single separate job manager and two job handlers; at this point I
>> believe this problem to be related to the servers only (i.e. the job
>> manager and job handlers do not app!
>> ear to be crashing).
>> Now, I believe that late last week I might have 'caught' the last server
>> process dying, just by coincidence, although I am not 100% certain. Here
>> is the Traceback as it occurred:
>> galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:47:12,011 (6822/39485.sc01)
>> PBS job state changed from Q to R
>> galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:36,565 (6822/39485.sc01)
>> PBS job state changed from R to C
>> galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:36,566 (6822/39485.sc01)
>> PBS job has completed successfully
>> galaxy.jobs DEBUG 2013-07-02 08:54:36,685 Tool did not define exit code
>> or stdio handling; checking stderr for success
>> galaxy.datatypes.metadata DEBUG 2013-07-02 08:54:36,812 loading metadata
>> from file for: HistoryDatasetAssociation 6046
>> galaxy.jobs DEBUG 2013-07-02 08:54:38,153 job 6822 ended
>> galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:49,130 (6812/39473.sc01)
>> PBS job state changed from R to E
>> galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:52,267 (6812/39473.sc01)
>> PBS job state changed from E to C
>> galaxy.jobs.runners.pbs ERROR 2013-07-02 08:54:52,267 (6812/39473.sc01)
>> PBS job failed: Unknown error: -11
>> galaxy.jobs.runners ERROR 2013-07-02 08:54:52,267 (unknown) Unhandled
>> exception calling fail_job
>> Traceback (most recent call last):
>> File "/group/galaxy/galaxy-dist/lib/galaxy/jobs/runners/__init__.py",
>> line 58, in run_next
>> File "/group/galaxy/galaxy-dist/lib/galaxy/jobs/runners/pbs.py", line
>> 560, in fail_job
>> if pbs_job_state.stop_job:
>> AttributeError: 'AsynchronousJobState' object has no attribute 'stop_job'
>> Now, I have some questions regarding this issue;
>> 1) It appears to me that although this is a sub-optimal solution,
>> restarting Galaxy solves this problem (i.e. server processes will be
>> listening after restarting Galaxy). Is it possible, or safe, or sane to
>> just restart a single server on a singe port? Ideally I would actually
>> like to fix the problem that is causing my server processes to crash,
>> although I figured it wouldn't hurt to ask this question regardless.
>> 2) Similar to the question above, is it possible to configure Galaxy in a
>> way that server processes re-spawn in a self-service manner (i.e. is this
>> a feature of Galaxy, for example, because server processes dying regularly
>> is either a known issue or expected and tolerable (but undesired) behaivor)?
>> 3) To me, the error messages above aren't very meaningful, other than the
>> Traceback appears to be PBS-related. Would anybody be able comment on the
>> problem above (i.e. have you seen something like this), or comment on
>> Galaxy server processes dying in general? I have done some brief searching
>> of the Galaxy mailing list for server crashes and did not find anything
>> suggesting this is a common problem.
>> 4) I am not 100% confident at this point that the Traceback above is what
>> killed the server process. Does anybody know of a specific string I can
>> search for (a literal) to identify when a server process actually dies? I
>> have done some basic review of log data (our Galaxy server generates lots
>> of logs), and Traceback does not appear to be a valid string to uniquely
>> identify a server crash (they occur too frequently). I currently have
>> logging configured at DEBUG.
>> In case this is relevant, I am using the following change set for Galaxy:
>> > hg parents
>> changeset: 9320:47ddf167c9f1
>> branch: stable
>> tag: tip
>> user: Nate Coraor <n...@bx.psu.edu>
>> date: Wed May 01 09:50:31 2013 -0400
>> summary: Use Galaxy's ErrorMiddleware since Paste's doesn't return
>> start_response. Fixes downloading tarballs from the Tool Shed when
>> use_debug = false.
>> I appreciate the time you took in reading my email, and any expertise you
>> could provide in helping me troubleshoot this issue.
>> Dan Sullivan
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
To search Galaxy mailing lists use the unified search at: