On Mar 13, 2012, at 6:59 AM, David Matthews wrote:
> We emailed previously about possible memory leaks in our installation of
> Galaxy here on the HPC at Bristol. We can run Galaxy just fine on our login
> node but when we integrate into the cluster using pbs job runner the whole
> thing falls over - almost certainly due to a memory leak. In essence, every
> attempt to submit a TopHat job (with 2x5GB paired end reads to the full human
> genome) always results in the whole thing falling over - but not when Galaxy
> is restricted to the login node.
> We saw that Nate responded to Todd Oakley about a week ago saying that there
> is a memory leak in libtorque or pbs_python when using the pbs job runner.
> Have there been any developments on this ?
> Best Wishes,
I am almost certain that the problem you have with tophat is not due to the
same leak, since it's a slow leak, not an immediate spike. Before we go any
further, in reading back over our past conversation about this problem, I
noticed that I never asked whether you've set `set_metadata_externally = True`
in your Galaxy config. If not, this is almost certainly the cause of the
If you're already setting metadata externally, answers to a few of the
questions I asked last time (or perhaps any findings of your HPC guys) and a
few new things to try would be helpful in figuring out why your tophat jobs
1. Create a separate job runner and web frontend so we can be sure that the job
running portion is the memory culprit:
You would not need any of the load balancing config, just start a single web
process and a single runner process. From reading your prior email I believe
you have a proxy server, and so as long as you start the web process on the
same port as your previous Galaxy server, no change would be needed to your
2. Set use_heartbeat = True in the config file of whichever process is
consuming all of the memory.
3. Does the MemoryError appear in the log after Galaxy has noticed that the job
has finished on the cluster (`(<id>/<pbs id>) PBS job has left queue`), but
before the job post-processing is finished (`job <id> ended`)?
4. Does the MemoryError appear regardless of whether anyone accesses the web
There is another memory consumption problem we'll look at soon, which occurs
when the job runner reads the metadata files written by the external
set_metadata tool. If the output dataset(s) have an extremely large number of
columns, this can cause a very large, nearly immediate memory spike when job
post-processing begins, even if the output file itself is relatively small.
> Dr David A. Matthews
> Senior Lecturer in Virology
> Room E49
> Department of Cellular and Molecular Medicine,
> School of Medical Sciences
> University Walk,
> University of Bristol
> BS8 1TD
> Tel. +44 117 3312058
> Fax. +44 117 3312091
> Please keep all replies on the list by using "reply all"
> in your mail client. To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at: