Re: [galaxy-dev] Jobs remain in queue until restart

Nate Coraor Wed, 14 Aug 2013 08:51:51 -0700

On Aug 2, 2013, at 1:06 PM, Thon de Boer wrote:

> I did some more investigation of this issue
>  
> I do notice that my 4 core, 8 slot VM machine has a load of 32, with only my 
> 4 handler processes running (Plus my web server), but not even getting more 
> than 10% of the CPU each.
> There seems to be some process in my handlers that takes an incredible amount 
> of resources, even though TOP is not showing that (Show below)
>  
> Has anyone have any idea how to figure out where the bottleneck is?
> Is there a way to turn on more detailed logging perhaps to see what each 
> process is doing?
>  
> My IT guy suggested there may be some “context Switching” going on due to the 
> many threads that are running (I use a threadpool of 7 for each server), but 
> not sure how to address that issue…


Hi Thon,

It looks like it's probably the memory use - if you restart the Galaxy 
processes, do you see any change?

--nate

>  
> Anyone?
>  
> top - 10:00:53 up 37 days, 19:29,  8 users,  load average: 32.10, 32.10, 32.09
> Tasks: 181 total,   1 running, 180 sleeping,   0 stopped,   0 zombie
> Cpu(s):  4.8%us,  2.5%sy,  0.0%ni, 92.5%id,  0.0%wa,  0.0%hi,  0.2%si,  0.0%st
> Mem:  16334504k total, 16164084k used,   170420k free,   127720k buffers
> Swap:  4194296k total,    15228k used,  4179068k free,  2460252k cached
>  
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 7190 svcgalax  20   0 2721m 284m 5976 S  9.9  1.8 142:53.84 python 
> ./scripts/paster.py serve universe_wsgi.ini --server-name=handler3 
> --pid-file=handler3.pid --log-file=handler3.log --daemon
> 7183 svcgalax  20   0 2720m 286m 5984 S  6.4  1.8 135:52.63 python 
> ./scripts/paster.py serve universe_wsgi.ini --server-name=handler2 
> --pid-file=handler2.pid --log-file=handler2.log --daemon
> 7175 svcgalax  20   0 2720m 287m 5976 S  5.6  1.8 117:59.40 python 
> ./scripts/paster.py serve universe_wsgi.ini --server-name=handler1 
> --pid-file=handler1.pid --log-file=handler1.log --daemon
> 7166 svcgalax  20   0 3442m 2.7g 4884 S  4.6 17.5  74:31.66 python 
> ./scripts/paster.py serve universe_wsgi.ini --server-name=web0 
> --pid-file=web0.pid --log-file=web0.log --daemon
> 7172 svcgalax  20   0 2720m 294m 5984 S  4.0  1.8 133:17.19 python 
> ./scripts/paster.py serve universe_wsgi.ini --server-name=handler0 
> --pid-file=handler0.pid --log-file=handler0.log --daemon
> 1564 root      20   0  291m  13m 7552 S  0.3  0.1   1:49.65 /usr/sbin/httpd
> 7890 svcgalax  20   0 17216 1456 1036 S  0.3  0.0   2:15.73 top
> 10682 apache    20   0  297m  11m 3516 S  0.3  0.1   0:02.23 /usr/sbin/httpd
> 11224 apache    20   0  295m  11m 3236 S  0.3  0.1   0:00.29 /usr/sbin/httpd
> 11263 svcgalax  20   0 17248 1460 1036 R  0.3  0.0   0:00.06 top
>     1 root      20   0 21320 1040  784 S  0.0  0.0   0:00.95 /sbin/init
>     2 root      20   0     0    0    0 S  0.0  0.0   0:00.01 [kthreadd]
>     3 root      RT   0     0    0    0 S  0.0  0.0   0:06.35 [migration/0]
>  
> Regards,
>  
> Thon
>  
> Thon deBoer Ph.D., Bioinformatics Guru 
> California, USA |p: +1 (650) 799-6839  |m:  [email protected]
>  
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Thon Deboer
> Sent: Wednesday, July 17, 2013 11:31 PM
> To: [email protected]
> Subject: [galaxy-dev] Jobs remain in queue until restart
>  
> Hi,
>  
> I have noticed that from time to time, the job queue seems to be “stuck” and 
> can only be unstuck by restarting galaxy.
> The jobs seem to be in the queue state and the python job handler processes 
> are hardly ticking over and the cluster is empty.
>  
> When I restart, the startup procedure realizes all jobs are in the a “new 
> state” and it then assigns a jobhandler after which the jobs start fine….
>  
> Any ideas?
>  
>  
> Thon
>  
> P.S I am using the june version of galaxy and I DO set limits on my users in 
> job_conf.xml as so: (Maybe it is related? Before it went into dormant mode, 
> this user had started lots of jobs and may have hit the limit, but I assumed 
> this limit was the number of running jobs at one time, right?)
>  
> <?xml version="1.0"?>
> <job_conf>
>     <plugins workers="4">
>         <!-- "workers" is the number of threads for the runner's work queue.
>              The default from <plugins> is used if not defined for a <plugin>.
>           -->
>         <plugin id="local" type="runner" 
> load="galaxy.jobs.runners.local:LocalJobRunner" workers="2"/>
>         <plugin id="drmaa" type="runner" 
> load="galaxy.jobs.runners.drmaa:DRMAAJobRunner" workers="8"/>
>         <plugin id="cli" type="runner" 
> load="galaxy.jobs.runners.cli:ShellJobRunner" workers="2"/>
>     </plugins>
>     <handlers default="handlers">
>         <!-- Additional job handlers - the id should match the name of a
>              [server:<id>] in universe_wsgi.ini.
>          -->
>         <handler id="handler0" tags="handlers"/>
>         <handler id="handler1" tags="handlers"/>
>         <handler id="handler2" tags="handlers"/>
>         <handler id="handler3" tags="handlers"/>
>         <!-- <handler id="handler10" tags="handlers"/>
>         <handler id="handler11" tags="handlers"/>
>         <handler id="handler12" tags="handlers"/>
>         <handler id="handler13" tags="handlers"/>
>         -->
>     </handlers>
>     <destinations default="regularjobs">
>         <!-- Destinations define details about remote resources and how jobs
>              should be executed on those remote resources.
>          -->
>         <destination id="local" runner="local"/>
>         <destination id="regularjobs" runner="drmaa" tags="cluster">
>             <!-- These are the parameters for qsub, such as queue etc. -->
>             <param id="nativeSpecification">-V -q long.q -pe smp 1</param>
>         </destination>
>         <destination id="longjobs" runner="drmaa" tags="cluster,long_jobs">
>             <!-- These are the parameters for qsub, such as queue etc. -->
>             <param id="nativeSpecification">-V -q long.q -pe smp 1</param>
>         </destination>
>         <destination id="shortjobs" runner="drmaa" tags="cluster,short_jobs">
>             <!-- These are the parameters for qsub, such as queue etc. -->
>             <param id="nativeSpecification">-V -q short.q -pe smp 1</param>
>         </destination>
>         <destination id="multicorejobs4" runner="drmaa" 
> tags="cluster,multicore_jobs">
>             <!-- These are the parameters for qsub, such as queue etc. -->
>             <param id="nativeSpecification">-V -q long.q -pe smp 4</param>
>         </destination>
>  
>         <!-- <destination id="real_user_cluster" runner="drmaa">
>             <param 
> id="galaxy_external_runjob_script">scripts/drmaa_external_runner.py</param>
>             <param 
> id="galaxy_external_killjob_script">scripts/drmaa_external_killer.py</param>
>             <param 
> id="galaxy_external_chown_script">scripts/external_chown_script.py</param>
>         </destination> -->
>  
>         <destination id="dynamic" runner="dynamic">
>             <!-- A destination that represents a method in the dynamic 
> runner. -->
>             <param id="type">python</param>
>             <param id="function">interactiveOrCluster</param>
>         </destination>
>     </destinations>
>     <tools>
>         <!-- Tools can be configured to use specific destinations or handlers,
>              identified by either the "id" or "tags" attribute.  If assigned 
> to
>              a tag, a handler or destination that matches that tag will be
>              chosen at random.
>          -->
>         <tool id="bwa_wrapper" destination="multicorejobs4"/>
>     </tools>
>     <limits>
>         <!-- Certain limits can be defined.
>         <limit type="registered_user_concurrent_jobs">500</limit>
>         <limit type="unregistered_user_concurrent_jobs">1</limit>
>         <limit type="concurrent_jobs" id="local">1</limit>
>         <limit type="concurrent_jobs" tag="cluster">200</limit>
>        <limit type="concurrent_jobs" tag="long_jobs">200</limit>
>         <limit type="concurrent_jobs" tag="short_jobs">200</limit>
>         <limit type="concurrent_jobs" tag="multicore_jobs">100</limit>
>         -->
>     </limits>
> </job_conf>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>  http://lists.bx.psu.edu/
> 
> To search Galaxy mailing lists use the unified search at:
>  http://galaxyproject.org/search/mailinglists/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Jobs remain in queue until restart

Reply via email to