Dear colleagues, 

I am working on the assembly of a plant chromosome (about 750 Mbp long) with 
lots of repeated regions. 

I have a dataset of 617Mi 2x100 bp Illumina paired-end reads (1,234 Mi single 
reads) and 1 Mi 510 bp average 454/Roche reads. 

Because this is my first time working with this amount of data and because our 
in house cluster is unavailable at the time, I will need to rent some Amazon 
EC2 resources. I've never tried this before and I'm not sure where to start. 
I've read the "Out Damned Spot Instance! Out I say!" post, but I am still lost. 
Still, I read something interesting in the post: 

"a user can share their machine images with pre-installed software". 

Then my questions are: 

- would a shared machine image work for me? 

* If the answer is "yes", do you have a machine that could share to run with 
Ray? 
* if the answer is "no", do hou have/know any tutorial on how to deploy the 
appropriate machine (or cluster?) to run Ray for my project? 

- what would you recommend for my project: 
a) a complete de novo assembly using both Illumina and 454/Roche reads, or b) a 
two step assembly, first Illumina then 454/Roche with Illumina output? If the 
answer is b), is this possible using Ray? 

- do you know how much RAM memory would be needed to perform this assembly? or 
do you know how can I estimate it? 

- have you ever tried something similar in Amazon EC2? could yoy give me a cost 
and/or time estimation? 

- I would like the processes to finish as fast as possible but spending as less 
money as possible: do you have any recommendations on how many cores I should 
rent, how much memory should the server have (or how much memory is needed per 
process), etc.? 

- I read in Rays user manual that for very large jobs (is this project a very 
large job?) routing should be enabled unless using a good interconnection: how 
can I check if the rented machine is using a good interconnection? Moreover, is 
there a trade-off between using lots of processes with routing enabled and 
using less processes but without routing? e.g. using 56 processes without 
routing would be as fast as using 64 processes with routing enabled? 

I will really appreciate any help. 

Sorry for the many questions. 

Thank you all very much in advance. 

Best regards, 

Santiago 
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to