On 26/06/13 07:11 AM, Lars Arvestad wrote:
> Hi,
>
> I am involved in the genome project for the spruce Picea abies, which you may 
> have seen published recently. We are looking at updating our assembly with 
> both new data and new tools.

Okay.

>Since we have passed our first milestone,

Well, congratulations on the milestone archievement !

> it is time to review options for assembly that have appeared after we last 
> committed on a toolset.
>I recently heard from John MacKay that you successfully applied Ray to P. 
>glauca and that tells me that we should add Ray to our list of candidate tools.

Indeed. We have improved the scalability of Ray on the 8 billion read dataset. 
( => https://github.com/sebhtml/SRA056234-Picea-glauca )

I will blog about this shortly (I just returned from a 1-week break).


>I would therefore like to ask for your comments about the experiment and the 
>feasibility for us to use Ray on our P. abies data.

For P. glauca, I used a IBM Blue Gen/Q (the one at SciNet, in Toronto, Canada).

How many reads do you have, and how many machines do you have (or have access 
to ?) ?

>
> John showed me a table that indicated that Ray gave more contigs and a lower 
> N50, but delivered a longer contig that ABySS could produce.

Indeed:

Job= SRA056234-Picea-glauca-2013-05-13-5

This was with a k-mer length of 95, and 4096 MPI ranks.
However, 4096 may sound a lot, but the processors of a IBM Blue Gene/Q don't 
have out-of-order execution and their frequency is lower than
AMD's or Intel's.

The assembly was done with checkpoints, with a total run time of about 3 days I 
think.

Also, with these large kmers, we recently found out another thing to improve in 
Ray algorithms (that will improve contiguity and lower running
time of the seed extension).

   see => https://github.com/sebhtml/ray/issues/188


A lot of the time spent on this white spruce data was adding (and debugging) 
parallel I/O to write the contigs in parallel.

>  Have you
>  performed other assembly comparisons as well?

Not really, I mostly just compared numbers with ABySS's assembly.
I did other assemblies with shorter k-mer length (31), but these were not 
really good.

Shaun Jackman suggested a larger k-mer length.

They published a nice paper about how they did it and everything: 
http://bioinformatics.oxfordjournals.org/content/29/12/1492.full

>
> Could you please comment on the resource usage needed for Ray on this large 
> genome?

Well, surely you have to have a Bloom filter (software).

Ray has that. ABySS should have that too IMHO.

For the white spruce, I used 4096 MPI ranks, 1024 nodes, and the hardware 
collectively provided
16 TiB of RAM (DDR3 I think).

Most MPI ranks used around 700 MiB, but a few of them growed their virtual 
memory up to 2-3 GiB.


Ray will run nicely at this scale with a good interconnect (like a Cray XE6, a 
Blue Gene/Q, or a IBM iDataPlex).


I hope I answered your questions !


==Séb==

>
> Thanks!
> Lars Arvestad
>
>
> --
> Swedish e-Science Research Center
> Science for Life Laboratory
> Dept of Num Analysis and Computer Science
> Stockholm University
>




------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to