Hello Ryan,

Ryan Golhar wrote, On 04/07/2011 05:40 PM:
> Hi all - So, I been asked to provide specs for a production Galaxy 
> system to support approximately 20-30 users.  Most of these users
> are new to bioinformatics and very new to NGS.  I'm targeting a user
> base that will use a light to moderate amount of NGS data.

I discovered that those "new to NGS" cause the most amount of damage :) 
running bowtie/tophat/bwa over and over again, and even "moderate" amounts of 
NGS data can be become taxing very quickly (by grooming :) ).

> I've looked at the the Produce Server Wiki page stuff, but I'm 
> curious what everyone else is using or recommends? How big of a 
> compute cluster, how much storage, proxy/web server configurations, 
> etc, etc.

We're using a 16 core, 32GB RAM, ~15TB storage server and it's good for most 
"regular" galaxy operations, but severely lacking for NGS mapping.
We are moving onto a 48 core, 128GB, 34TB storage and hope it'll be somewhat 
better (still not enough for heavy NGS usage).
We're also running some jobs on an SGE cluster.

----
Storage: NGS data sizes grows way faster than storage sizes, so planning is 
hard.

What would you call a "moderate amount" of NGS data ?

Let's say a single Illumina lane is our unit of choice.
A paired-end run with 72-cycles yielding 35M reads (reasonable in today's 
terms) gives ~15GB per lane.
A paired-end run with 100-cycles on a HiSeq machine will hopefully yield 200M 
reads (in the near future?) - each lane will be ~100GB.
Those numbers are uncompressed FASTQ files (galaxy can't handle compressed data 
at the moment).
Of course your users could be doing just single-end 36-cycles - but don't plan 
for the best-case scenario.
With sequencing costs dropping rapidly, your users will do more sequencing than 
you expect.

Now, take the size of one lane of data (15GB, 100GB, whatever), and look at 
your expected galaxy pipeline:
1. upload the files (size*1)
2. groom the files (argggg. size*1)
3. map with something, get an unsorted SAM file (size*3 to size*5, depending on 
mapping parameters)
4. Convert to BAM (size*1, luckily it's compressed)
5. use those aligned reads for annotation or similar (size*1)
etc. etc.

If you use Galaxy's library management, then you'll want to keep all your FASTQ 
files somehow available for Galaxy at all times - meaning more storage.

The only way we're able to keep storage at 15TB is by aggressively deleting 
user's datasets (as Hans wrote),
but plan for "lane size"*10 at least for temporary storage, and how many lanes 
you're going to handle at once before you start deleting files.
and probably even that wouldn't be enough :(

-----
Processes:

The servers processes that you should plan for are:
1 galaxy process for job-runner
2 or 3 galaxy processes for web-fronts
1 process of postgres
1 process of apache
optionally 1 process of galaxy-reports
you'll also want to leave some free CPUs for SSH access, CRON jobs and other 
peripherals.
Postgres & apache are multithreaded, but it usually balances out with light 
load on the web/DB front from galaxy (even with 30 users).
So all in all, I'd recommend reserving 5 to 8 CPU cores to just galaxy and 
daemons (reserving means: never using those cores for galaxy jobs).
You can do with less cores, but then response times might suffer (and it's 
annoying when you click "show saved histories" and the page takes 20 seconds to 
load...).

If other people are using different calculations, I'm more than happy to hear.

------
Galaxy jobs:

Compared with NGS related jobs (i.e. mapping), most galaxy jobs are very simple 
and light (even if they take long to complete).

Plan by estimating how much time a common pipeline takes for a single lane:
Let's say I have a 48-core server, with 8 cores reserved for Galaxy/deamons.
that leaves me with 40 cores.
I can plan for 1 galaxy job with 40 threads (bowtie/tophat/bwa/etc.), or 2 jobs 
with 20 threads, or 4 jobs with 10 threads, etc.
How much time do you want your users to wait for their jobs to complete ?
with only 10 threads, 4 users can run jobs at the same time (but each job would 
take longer).
with 20 threads, only 2 users can run jobs at the same time (but hopefully it 
will be faster).

As Hans wrote, most of the time just a few users are actively using your galaxy 
server at any single point in time.
But if two users started a mapping job that will take 4 hours to complete, and 
three hours later another user wants to start a mapping job - hell have to 
wait...

So it's a balancing act between providing users fast results, and keeping the 
costs down (with fewer cores/nodes).

I don't know of a good text book answer here, it depends on what your users are 
willing to accept:
If you got a whole flowcell (7 or 8 lanes, one for each user) - the first 
two/four users can run jobs immediately, the rest will have to wait for 5 hours 
- is that acceptable ? if not - get more nodes.

If you buy more than one node (i.e. not just one machine with 48 cores), I'd 
recommend going for fewer nodes with many cores (as opposed to many nodes with 
just 4 cores). It seems most common tools today make better use of threads 
(requiring SMP) then using MPI or similar non-shared-memory processing.


Comments are always welcomed,
 -gordon
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to