On 26/06/13 07:11 AM, Lars Arvestad wrote: > Hi, > > I am involved in the genome project for the spruce Picea abies, which you may > have seen published recently. We are looking at updating our assembly with > both new data and new tools.
Okay. >Since we have passed our first milestone, Well, congratulations on the milestone archievement ! > it is time to review options for assembly that have appeared after we last > committed on a toolset. >I recently heard from John MacKay that you successfully applied Ray to P. >glauca and that tells me that we should add Ray to our list of candidate tools. Indeed. We have improved the scalability of Ray on the 8 billion read dataset. ( => https://github.com/sebhtml/SRA056234-Picea-glauca ) I will blog about this shortly (I just returned from a 1-week break). >I would therefore like to ask for your comments about the experiment and the >feasibility for us to use Ray on our P. abies data. For P. glauca, I used a IBM Blue Gen/Q (the one at SciNet, in Toronto, Canada). How many reads do you have, and how many machines do you have (or have access to ?) ? > > John showed me a table that indicated that Ray gave more contigs and a lower > N50, but delivered a longer contig that ABySS could produce. Indeed: Job= SRA056234-Picea-glauca-2013-05-13-5 This was with a k-mer length of 95, and 4096 MPI ranks. However, 4096 may sound a lot, but the processors of a IBM Blue Gene/Q don't have out-of-order execution and their frequency is lower than AMD's or Intel's. The assembly was done with checkpoints, with a total run time of about 3 days I think. Also, with these large kmers, we recently found out another thing to improve in Ray algorithms (that will improve contiguity and lower running time of the seed extension). see => https://github.com/sebhtml/ray/issues/188 A lot of the time spent on this white spruce data was adding (and debugging) parallel I/O to write the contigs in parallel. > Have you > performed other assembly comparisons as well? Not really, I mostly just compared numbers with ABySS's assembly. I did other assemblies with shorter k-mer length (31), but these were not really good. Shaun Jackman suggested a larger k-mer length. They published a nice paper about how they did it and everything: http://bioinformatics.oxfordjournals.org/content/29/12/1492.full > > Could you please comment on the resource usage needed for Ray on this large > genome? Well, surely you have to have a Bloom filter (software). Ray has that. ABySS should have that too IMHO. For the white spruce, I used 4096 MPI ranks, 1024 nodes, and the hardware collectively provided 16 TiB of RAM (DDR3 I think). Most MPI ranks used around 700 MiB, but a few of them growed their virtual memory up to 2-3 GiB. Ray will run nicely at this scale with a good interconnect (like a Cray XE6, a Blue Gene/Q, or a IBM iDataPlex). I hope I answered your questions ! ==Séb== > > Thanks! > Lars Arvestad > > > -- > Swedish e-Science Research Center > Science for Life Laboratory > Dept of Num Analysis and Computer Science > Stockholm University > ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users