Re: [galaxy-dev] A Galaxy environment that can support up to 35 users?

Assaf Gordon Mon, 17 Sep 2012 11:11:45 -0700

Hello Dan,

Couple of lessons we learned from setting up similar workshop-galaxies:


Dan Sullivan wrote, On 09/17/2012 01:04 PM:
> Hi, Galaxy Developers,
> 
> Is anybody out there managing a Galaxy environment that was designed and or 
> has been tested to support 35 concurrent users?  The reason why I am asking 
> this is because we [the U of C] have a training session coming up this 
> Thursday, and the environment we have deployed needs to support this number 
> of users.  We have put the server under as high as stress as possible with 6 
> users, and Galaxy has performed fine, however it has proven somewhat 
> challenging to do load testing for all 35 concurrent users prior to the 
> workshop.  I can't help but feel we are rolling the dice a little bit as 
> we've never put the server under anything close to this load level, so I 
> figured I would try to dot my i's by sending an email to this list.
> 
> Here are the configuration changes that are currently implemented (in terms 
> of trying to performance tune and web scale our galaxy server):
> 
> 1) Enabled proxy load balancing with six web front-ends (the number six 
> pulled from Galaxy wiki) (Apache):
> 
When configured correctly, 3 or 4 web-front-ends seemed sufficient.
(When configured incorrectly, it doesn't matter how many you have, performances 
will suffer :) ).

Given that you only have 4 CPUs/cores for your machine, having six front-ends 
seems too much.

> 2) Rewrite static URLs for static content (Apache):
> 3) Enabled compression and caching (Apache):

This might sounds obvious, but test that it actually works (e.g. check in the 
apache logs that the static files were served by apache, not by galaxy).
Typos and other minor incompatibilities can cause the URLs to be served by 
galaxy, which will waste resources.

> 4) Configured web scaling (universe_wsgi.ini) :
>       a) six web server processes (threadpool_workers = 7)
>       b) a single job manager (threadpool_workers = 5)
>         c) two job handlers (threadpool_workers = 5)

Again, with a system of only 4 CPUs, you might overload your server.

> 5) Configured a pbs_mom external job runner (our cluster), and commented out 
> the default tool runners (to use pbs)  (we are not using the other tools for 
> the workshop).
> 
> #ucsc_table_direct1 = local:///
> #ucsc_table_direct_archaea1 = local:///
> #ucsc_table_direct_test1 = local:///
> #upload1 = local:///
> 

Unless your workshop is *tightly* scripted, you can't really tell which tool 
users will use.
If this is an introduction to galaxy, users will experiment with some tools 
(even if you don't tell them to).

(also, I'm not sure if those data import tools can run on your cluster node).


> 6)  Changed the following database parameters (universe_wsgi.ini):
>       database_engine_option_pool_size = 10
>       database_engine_option_max_overflow = 20 

Assuming you're using PostgreSQL (and you shouldn't use anything else, in 
practice), add the following:
   database_engine_option_server_side_cursors = True

And I would set "pool_size" to 50 and "max_overflow" to 100 - seems excessive, 
but under the load of 20 users hammering at galaxy at the same time in a short 
time window,
I got the "database connection pool size" errors within 10 minutes.   
 
> The server I have is a VM with the following resources:
> 
> 2GB of RAM
> 4CPU Cores
> 

IMHO, that's too little memory and CPUs.

A ball-park figures for our servers:

memory-wise:
each web-front-end python process takes ~300MB (and you plan for 6 of them).
and you also have 3 more python processes (1 job manager + 2 job handlers).

CPU-wise:
In addition to 9 python processes, you will have several PostgreSQL processes, 
few apache threads, and some other system processes running.
Even when each python process doesn't run at full capacity (ie. 100% CPU), your 
system already sounds overloaded.
When jobs are running (at least on our system) the job-handlers consume some 
CPU time by just monitoring the jobs.
When users submit large workflows with many jobs, the job-handlers take 100% 
cpu for a short time.
with all of the above combined, I would say 4 CPUs sounds a bit weak.


> I feel that it is also worthwhile to mention that users will not be 
> downloading datasets during the workshop, so as of now, the implementation of 
> "XSendFile" as specified in the Apache Proxy documentation is not of 
> immediate concern.
> 

IMHO, this is a wrong assumption.
You can not fully control what users in a workshop are doing.
Imagine that only two users out of your 35 will click (even accidentally) on 
the download icon, and start downloading a big file - if downloads are handled 
by the python processes - then immediately two of your six web-front-ends are 
now busy and can't serve other users.

Also - regardless of how big the downloaded files are, Apache+XSendFile will be 
more efficient at sending files to the user (then python), and with just 4 CPUs 
you definitely want to conserve as many resources as possible.


> Does anybody see any blaring mistakes where they think this configuration 
> might fall short with respect to capacity planning for an environment of 35 
> concurrent users, or additional tuning that could potentially assist in 
> ensuring the availability of the server during the workshop?   Thank-you so 
> much for your opinion(s), and please wish us luck this Thursday :-)
> 

Another important lesson we learned for workshop is to carefully plan each 
example, and especially measure the upload, download and processing time.

It's hard to give specific details without knowing more about your workshop, 
but generally:

1. work with small datasets (e.g. if you show-case NGS workflows, take a tiny 
subset of a HiSeq run).

2. work with small genomes (e.g. yeast). If you must work with bigger genomes, 
work with a single chromosome, and ensure (before hand) that the input files 
contain reads reads / intervals that would map to that chromosome and would 
give meaningful output.

3. Rehearse and workflow you are going to present - measure how long it takes 
to submit it, and for it to complete.
To to submit the same workflow in parallel from 10 different machines (at the 
same time) - see if your server can handle it, and how long it takes to 
complete.
(The reason being - if you plan it wrong, the instructor might tell the users 
to do something, and it can take 20 minutes for all the jobs of all the users 
to complete, before the workshop can go on - very frustrating).
Try a workflow of 10 tools at least (or better, similar to the one actually 
presented in the workshop) - submitting large workflows is somewhat of a 
bottle-neck for the galaxy processes (at least on our server).

4. Publish the workflow in a way the users can easily import it (e.g. put the 
URL in place the users can click and import it) - for users who don't want to 
build it them selves.

5. Publish *the results* of running the workflow on your example input files 
(the actual input files used in the workflow) - will save *a lot* of time for 
users who don't want to run things, or (embarrassingly) if something goes wrong 
with the server, and you must show the results and keep the workshop going 
(speaking from experience).
We even reproduced the exact workflow and history with full results on the 
public Galaxy server, and we could easily divert the users to the public server 
and tell them that these are the results they would get (with an apology that 
our local server couldn't handle their load) - so they could still explore the 
resulting files (and learn the file formats) when our server couldn't handle it.

6. Test "input" methods (how users will put data in your galaxy).
Best way (IMHO) is uploading through URL - publish your input files somewhere, 
and give users simple URLs they can paste in the "get data" tool.
Uploading from local computer is error prone (especially with big files),
and uploading with FTP is confusing and complicated to explain in a workshop 
(imaging telling new users they need to install an FTP program to send files... 
too distracting).
Then, measure how long it takes to upload those files into galaxy, and (if 
possible) how long it takes for 30 users to upload the same file at the same 
time - the instructor must be prepare to stall while files are uploading and 
jobs are running :)

7. Test "output" methods (how users will view the results).
If your tool outputs HTML, make sure your galaxy can display HTML properly 
(look for "sanitize_all_html" in universe_wsgi.ini).
If the output is BAM/BigWig/VCF/etc, make sure your galaxy can easily send 
tracks to the UCSC Genome browser (or another browser) - and make sure your 
server can handle the load of sending BAM files to UCSC (which brings XSEndFile 
again).
If you plan on using IGV - better prepare the users in advance to have java 
working properly.
 



Hope this helps,
 -gordon
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] A Galaxy environment that can support up to 35 users?

Reply via email to