Re: [Denovoassembler-users] RAY questions

Sébastien Boisvert Thu, 20 Oct 2011 13:29:16 -0700

On 20/10/11 12:51 PM, Ola Wallerman wrote:
> Hi Sebastien,
>    

Hi Ola,


> I am try ing out Ray for assembly of a human genome. I must say I am
> quite surprised by the results from my first try since it worked
> straight away without any problems, with only one program to run,
> which is not what one is used to in the NGS field... I installed v
> 1.7, run it with 300 M HiSeq PE reads on 20 nodes and it finished
> without any errors after ~ 12h, with ~1 Gbp assembled.
>
>    

One thing our team is aiming for with Ray is ease of use for the user. 
(It just works TM)
The complexity (like the various stages of the algorithm) is 
encapsulated in Ray.

Just out of curiosity, what is the inter-node latency of your compute 
resource ?

Ray tests the network before doing its deed so the latency is in the file
NetworkTest.txt. I am just curious though.

> I wonder if you could give me any advice on how to run it in the best
> way, eg should one use as many nodes as possible (we have 384 nodes
> with at least 24 GB) and should reads be quality filtered beforehand?

For the assemblathon, we did not filter reads at all.
With Ray, filtering reads only reduces memory usage, I believe.

If you know you have DNA contamination in your reads (non-human for 
instance),
then you should filter reads.

Adaptors utilised for the construction of so-called mate-pairs
through the circularisation of long DNA molecules may be present if you
have mate pairs. So far, it seems that the optical read markers in Ray 
deal with
that.



>
> The dataset I have is around 200 M HiSeq paired reads (100 bp, inserts
> ~150 to 300 bp) and ~3 billion short single end reads (~36 bp). I
> tried now with k=27, but perhaps a higher k is better for the long
> reads? The reason for doing the assembly is to use the contigs to get
> a better precision in calling indels and rearangements.
>
>    

Did you provide all the reads ?

You have to be careful with a k that is too large because Ray does not
attempt at all to correct the reads. The erroneous k-mers all go in an 
abyss (not the assembler !)
and are not really utilised at all.

In my experience, k=21 works well for bacteria and for other larger genomes,
I usually utilise k=25 or k=31 although I don't have that much 
experience on large genomes
aside from the assemblathon.


You can check some files generated by Ray for your first assembly.

In your assembly directory, the file CoverageDistributionAnalysis.txt
contains the peak coverage.

For your paired reads, the file LibraryStatistics.txt contains what Ray 
detected
in your reads.

This step is very important as paired reads are the workhorse to go from 
reads
to k-mer graph to seeds to extensions.

The importance of pairs is also highlighted by the recent application note
published by Illumina using the MiSeq and Ray.

http://www.illumina.com/documents/%5Cproducts%5Cappnotes%5Cappnote_miseq_denovo.pdf

There is also a file called SeedLengthDistribution.txt
This file contains the distribution of seed lengths. In Ray, a seed is a 
region of the genome
that is unique. I like to say that a seed is mostly similar conceptually 
to unitigs in
overlap-layout-consensus assemblers although I suspect there are some 
differences.


Increasing k increases the uniqueness of sub-sequences extracted
from reads but also reduces the usable sub-sequence coverage because
of the sequencing errors. As I said above, you can assess your sub-sequence
coverage (also known as k-mer coverage) by reading the content of the file
CoverageDistributionAnalysis.txt

> Best regards,
>
>    

Let me know if you have any other questions.

> Ola
>
>
>
>    

Sébastien

> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Ola Wallerman, PhD
> IGP, Uppsala Universitet
>
> waller...@gmail.com
> olawallerman@skype
> 0736400172
>
>    


------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] RAY questions

Reply via email to