[galaxy-dev] Local Galaxy concept system: hardware spec questions

Sebastian Schaaf Mon, 13 Aug 2012 03:21:06 -0700

Hi all,

I have a couple of question around the topic "hardware requirements" fora server which is intended to be bought and used as concept machine forNGS-related jobs. It should be used for development of tools andworkflows (using Galaxy, sure) as well as platform for some "alpha"users, who should learn to work on NGS data, which they just began togenerate.This concept phase is planned to last 1-2 years. During this time mainmemory and especially storage could be extended, the latter on aper-project basis. We will start with a small team of 3 people forsupporting and developing Galaxy and system due to the user'srequirements, and the first group of users will bring in data,scientific questions and hands-on work on their own data. Main task(regarding system load) will be sequence alignment (BLAST, mapping toolslike BWA/Bowtie), and after that maybe some experimental sequenceclustering/de novo assembly for exome data. Additionally variantdetection in whatever form are targeted. Only active projects will bestored locally, data no more in use will be stored elsewhere in the network.

So far for the setting, regarding the specs the following is intended:


- dual-CPU mainboard
- 256 GB RAM
- 20-30 TB HDD @ RAID6 (data)
- SSDs @ RAID5 (system, tmp)

Due to funding limitations it may be the case that RAM has to bedecreased to 128 GB, not solved is currently the question, if it will beenough for those SSD bundle in RAID5, maybe we have to go for only twoof them in RAID1.

What we try to find out is, where in those described tasks the machinewould run into bottlenecks. What's pretty clear is that I/O iseverything, already by a theoretical point of view. But we also observedthat on a comparable machine (2x 3,33 Ghz Intel 6-core, 100GB RAM, 450MB/s R/W to data RAID6).The question of questions is right at the beginning of configuring asystem, if one should go for an AMD or an Intel architecture system. Thefirst offers more cores (8-12) at a lower frequency (~2,4 Ghz), thelatter less cores (6) with higher frequency (~3,3 Ghz). Due to the datasheets, the Intel CPUs are on a per-core basis ~30% faster with integeroperations and ~50% faster with floating point. The risk we see with theAMDs is on the one hand that the number of cores per socket couldsaturate the memory controller, and on the other hand those jobs, whichcan not or only poorly be parallelized need more time.

To bring all this to some distinct questions (don't feel forced toanswer all of them):

1. Using the described bioinformatics software: where are the potentialsystem bottlenecks? (connections between CPUs, RAM, HDDs)

2. What is the expected relation of integer-based and floating pointbased calculations, which will be loading the CPU cores?

3. Regarding the architectural differences (strengths, weaknesses):Would an AMD- or an Intel-System be more suitable?

4. How much I/O (read and write) can be expected at the memorycontrollers? Which tasks are most I/O intensive (regarding RAM and/or HDDs)?

5. Roughly separated in mapping and clustering jobs: which amounts ofmain memory can be expected to be required by a single job (given e.g.Illumina exome data, 50x coverage)? As far as I know mapping should bearound 4 GB, clustering much more (may reach high double digits).

6. HDD access (R/W) is mainly in bigger blocks instead of masses ofshort operations - correct?

All those questions are a bit rough and improved (yes, it IS a bit of achaos currently - sorry for that), but any clue to a single questionwould help. "Unfortunately" we got the money to place the order for ourown hardware unexpectedly quick, and we are now forced to act. We wantto make as few cardinal errors as possible...


Thanks a lot in advance,

Sebastian



--
Sebastian Schaaf, M.Sc. Bioinformatics
Chair of Biometry and Bioinformatics
Department of Medical Information Sciences, Biometry and Epidemiology
University of Munich

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/

[galaxy-dev] Local Galaxy concept system: hardware spec questions

Reply via email to