On 18/12/13 02:50 PM, Santiago Revale wrote: > Dear colleagues, Hi,
> > I am working on the assembly of a plant chromosome (about 750 Mbp long) with > lots of repeated regions. > > I have a dataset of 617Mi 2x100 bp Illumina paired-end reads (1,234 Mi single > reads) and 1 Mi 510 bp average 454/Roche reads. > > Because this is my first time working with this amount of data and because > our in house cluster is unavailable at the time, I will need to rent some > Amazon EC2 resources. I've never tried this before and I'm not sure where to > start. I've read the "Out Damned Spot Instance! Out I say!" post, but I am > still lost. Still, I read something interesting in the post: > > "a user can share their machine images with pre-installed software". > > Then my questions are: > > - would a shared machine image work for me? > > * If the answer is "yes", do you have a machine that could share to run > with Ray? > * if the answer is "no", do hou have/know any tutorial on how to deploy > the appropriate machine (or cluster?) to run Ray for my project? > I know that some folks are using MIT StarCluster to drive their compute nodes and control them. http://star.mit.edu/cluster/ Aside from that, I can say that you should look for "Placement Groups" on Amazon EC2 to reduce latency. > - what would you recommend for my project: > a) a complete de novo assembly using both Illumina and 454/Roche reads, or b) > a two step assembly, first Illumina then 454/Roche with Illumina output? I don't recommend using contigs as input. Ray will perform better with reads directly because it can better utilize coverage information that way. Furthermore, long-read support in Ray is not as good as its support for short reads (like Illumina's). > If the answer is b), is this possible using Ray? > > - do you know how much RAM memory would be needed to perform this assembly? > or do you know how can I estimate it? > It is tricky to estimate because the amount depends on: 1. number of reads; 2. number of nucleotides in target genome; 3. error rate (which will roughly use memory linearly with the number of reads). > - have you ever tried something similar in Amazon EC2? I did mostly bacterial genomes on Amazon EC2 (either via MIT StarCluster, via home-made setup, or using DNAnexus scientific frontend for de novo genome assembly with Ray.) > could yoy give me a cost and/or time estimation? > > - I would like the processes to finish as fast as possible but spending as > less money as possible: do you have any recommendations on how many cores > I should rent, how much memory should the server have (or how much memory is > needed per process), etc.? > > - I read in Rays user manual that for very large jobs (is this project a very > large job?) routing should be enabled unless using a good interconnection: That's won't change anything in the cloud I think, depending obviously on the scale of your endeavour. > how can I check if the rented machine is using a good interconnection? Last time I checked, cc2.8xlarge spot instances with a placement group was a good return for the bucks. > Moreover, is there a trade-off between using lots of processes with routing > enabled and using less processes but without routing? e.g. using 56 processes >without routing would be as fast as using 64 processes with routing enabled? In order to lay out any meaningful estimate beforehand, you need to mention your target genome size, your number of reads, your read length, and technology. > > I will really appreciate any help. > > Sorry for the many questions. > > Thank you all very much in advance. > > Best regards, > > Santiago ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users