Re: [galaxy-dev] Galaxy Server Processes Dying?

Ross Mon, 08 Jul 2013 01:39:15 -0700

Hi Dan,
That's old code. Updating will probably help.
Logging level just takes disk space, but just in case you haven't followed
http://wiki.galaxyproject.org/Admin/Config/Performance/ProductionServer?action=show&redirect=Admin%2FConfig%2FPerformance
leaving
debug = True uncommented used to fill all available server process RAM
eventually AFAIK.
If not already done, try
# Debug enables access to various config options useful for development and
# debugging: use_lint, use_profile, use_printdebug and use_interactive.  It
# also causes the files used by PBS/SGE (submission script, output, and
error)
# to remain on disk after the job is complete.  Debug mode is disabled if
# commented, but is uncommented by default in the sample config.
# debug = True


Hope this helps?


On Mon, Jul 8, 2013 at 4:26 PM, Dan Sullivan <dansulli...@gmail.com> wrote:

> Hi, Galaxy Developers,
>
> I have what I'm hoping is a fairly simple inquiry for the Galaxy
> community; basically, our production Galaxy server processes appear to be
> dying off over time.  Our production Galaxy instance implements apache web
> scaling features so I have a number of server processes, for example my
> apache Apache configuration has:
>
> BalancerMember http://127.0.0.1:8080
> BalancerMember http://127.0.0.1:8081
> BalancerMember http://127.0.0.1:8082
> BalancerMember http://127.0.0.1:8083
> BalancerMember http://127.0.0.1:8084
> BalancerMember http://127.0.0.1:8085
>
> Nothing unconventional as I understand it.  Similarly, my galaxy config
> has matching [server:ws3], [server:ws2] configuration blocks for each of
> these processes.  When I restart Galaxy, everything is all fine and good.
>  I'll see a server listening on each one of these ports (if I do something
> like lsof -i TCP -P, for example).  What appears to be happening, is that
> for whatever reason, these server processes seem to die off over time (i.e
> eventually nothing is listening on ports 8080-8085).  This process can take
> days, and at the time when no servers are available, Apache will begin
> throwing 503 service unavailable errors.   I am fairly confident this
> process is gradual, for example I just checked now and the Galaxy was still
> available, however one server had died (the one on TCP port 8082).  I do do
> have a single separate job manager and two job handlers; at this point I
> believe this problem to be related to the servers only (i.e. the job
> manager and job handlers do not app!
>  ear to be crashing).
>
> Now, I believe that late last week I might have 'caught' the last server
> process dying, just by coincidence, although I am not 100% certain.  Here
> is the Traceback as it occurred:
>
> galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:47:12,011 (6822/39485.sc01)
> PBS job state changed from Q to R
> galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:36,565 (6822/39485.sc01)
> PBS job state changed from R to C
> galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:36,566 (6822/39485.sc01)
> PBS job has completed successfully
> galaxy.jobs DEBUG 2013-07-02 08:54:36,685 Tool did not define exit code
> or stdio handling; checking stderr for success
> galaxy.datatypes.metadata DEBUG 2013-07-02 08:54:36,812 loading metadata
> from file for: HistoryDatasetAssociation 6046
> galaxy.jobs DEBUG 2013-07-02 08:54:38,153 job 6822 ended
> galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:49,130 (6812/39473.sc01)
> PBS job state changed from R to E
> galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:52,267 (6812/39473.sc01)
> PBS job state changed from E to C
> galaxy.jobs.runners.pbs ERROR 2013-07-02 08:54:52,267 (6812/39473.sc01)
> PBS job failed: Unknown error: -11
> galaxy.jobs.runners ERROR 2013-07-02 08:54:52,267 (unknown) Unhandled
> exception calling fail_job
> Traceback (most recent call last):
>   File "/group/galaxy/galaxy-dist/lib/galaxy/jobs/runners/__init__.py",
> line 58, in run_next
>     method(arg)
>   File "/group/galaxy/galaxy-dist/lib/galaxy/jobs/runners/pbs.py", line
> 560, in fail_job
>     if pbs_job_state.stop_job:
> AttributeError: 'AsynchronousJobState' object has no attribute 'stop_job'
>
> Now, I have some questions regarding this issue;
>
> 1) It appears to me that although this is a sub-optimal solution,
> restarting Galaxy solves this problem (i.e. server processes will be
> listening after restarting Galaxy).   Is it possible, or safe, or sane to
> just restart a single server on a singe port?  Ideally I would actually
> like to fix the problem that is causing my server processes to crash,
> although I figured it wouldn't hurt to ask this question regardless.
> 2) Similar to the question above, is it possible to configure Galaxy in a
> way that server processes re-spawn  in a self-service manner (i.e. is this
> a feature of Galaxy, for example, because server processes dying regularly
> is either a known issue or expected and tolerable (but undesired) behaivor)?
> 3) To me, the error messages above aren't very meaningful, other than the
> Traceback appears to be PBS-related.  Would anybody be able comment on the
> problem above (i.e. have you seen something like this), or comment on
> Galaxy server processes dying in general?  I have done some brief searching
> of the Galaxy mailing list for server crashes and did not find anything
> suggesting this is a common problem.
> 4) I am not 100% confident at this point that the Traceback above is what
> killed the server process.  Does anybody know of a specific string I can
> search for (a literal) to identify when a server process actually dies?  I
> have done some basic review of log data (our Galaxy server generates lots
> of logs), and Traceback does not appear to be a valid string to uniquely
> identify a server crash (they occur too frequently).  I currently have
> logging configured at DEBUG.
>
> In case this is relevant, I am using the following change set for Galaxy:
> > hg parents
> changeset:   9320:47ddf167c9f1
> branch:      stable
> tag:         tip
> user:        Nate Coraor <n...@bx.psu.edu>
> date:        Wed May 01 09:50:31 2013 -0400
> summary:     Use Galaxy's ErrorMiddleware since Paste's doesn't return
> start_response.  Fixes downloading tarballs from the Tool Shed when
> use_debug = false.
> >
>
> I appreciate the time you took in reading my email, and any expertise you
> could provide in helping me troubleshoot this issue.
>
> Dan Sullivan
>

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Galaxy Server Processes Dying?

Reply via email to