Re: [Denovoassembler-users] Paired-end and single-end reads

Sébastien Boisvert Wed, 06 Mar 2013 09:33:03 -0800

Hello,

[Please CC the user mailing list]

On 03/05/2013 06:50 PM, Lo, Chien-Chi wrote:
> Hi Sébastien,
>
> I have a question about how Ray weight on the paired end reads. I ran a
> Ecoli Miseq paired end data using Ray 2.1.0 twice.
> First, I treat all
> reads are single-ended and secondly I ran Ray in paired-end mode. The
> commands I used are pasted below. The result contigs' number is very
> different. Why paired-end mode generate better result?

Let me tell you a short story.

With your DNA reads, Ray builds a graph. The graph is distributed uniformily 
onto all
the MPI ranks.

Until recently, it was impossible to see such a graph with energy.

Here's such a graph for Escherichia coli DH10B sequenced on a Illumina(R)
MiSeq(R) 2x250 and assembled with Ray using a k-mer length of 91.

http://browser.cloud.raytrek.com/client/?map=0&section=3&region=4&location=178944

Once this graph is built, the Ray MPI ranks will begin an extraordinary journey 
in which they will
collectively, as a tribe, perform parallel graph traversals using heuristics. 
This journey is
extraordinary because of the sheer amount of messages that are passed between 
MPI ranks -- the
computation granularity is in the order of 10-70 microseconds.

The heuristics in Ray use paired reads, mate pairs, and single-end read 
threading to perform
the graph traversals. So Ray uses paired information for other purposes than 
just scaffolding.
This is also the case with some other assemblers too, like ABySS.

These original Ray heuristics are described in this open-access publication:

     http://online.liebertpub.com/doi/abs/10.1089/cmb.2009.0238

Recently, our group have generalized these heuristics and other parts of the 
algorithms to handle
mixes of genomes. The new algorithms work well on bacterial genomes, on 
metagenomes, and also on
transcriptomes. However, the algorithms probably do not handle alternative 
splicing very well.

These generalizations are described in this open-access publication:

     http://genomebiology.com/2012/13/12/R122/abstract

> Thanks!
>
> For single-ended,
>    Contigs >= 500 nt
>    Number: 583
>    Total length: 4357532
>    Average: 7474
>    N50: 10494
>    Median: 5905
>    Largest: 38443
> For paired-ended,
>    Contigs >= 500 nt
>    Number: 87
>    Total length: 4603370
>    Average: 52912
>    N50: 106053
>    Median: 35487
>    Largest: 269165
>
>

Also, at the moment, the algorithms are better with pairs of reads than with 
single-end reads
because the algorithms will match the outer distances to the empirical 
distribution of signal for each library.

We devised some other neat algorithms on pairs, such as this:

     Constrained traversal of repeats with paired sequences.
     Sébastien Boisvert, Élénie Godzaridis, François Laviolette & Jacques 
Corbeil.
     First Annual RECOMB Satellite Workshop on Massively Parallel Sequencing, 
March 26-27 2011, Vancouver, BC, Canada.
     abstract: http://boisvert.info/publications/RECOMB-seq-2011-abstract.html
     presentation: http://www.boisvert.info/dropbox/recomb-seq-2011-talk.pdf

The read recycling in Ray actually is something that is not implemented in 
other assemblers, as far as I know.

I fixed a very rare bug in the read recycling code yesterday -- 
https://github.com/sebhtml/ray/commit/354c02bc7f3e963fb22809c3a5176e5f8d6cba26

For long reads, the matching algorithms are not devised to handle insertions 
very well, except in the case
where any insertion can be mated with a deletion (and vice-versa).

>
>
>
>
>
> ### Command 1 ####
> mpiexec -n 16 Ray \
>   -k \
>   31 \
>   -s \
>   MiSeq_Ecoli_MG1655_110721_PF_R1.fastq \
>   -s \
>   MiSeq_Ecoli_MG1655_110721_PF_R2.fastq \
>   -o \
>   Ray_single
> ####################
>
>

Did you know that Ray will accept natively compressed files, such as .fastq.bz2 
or .fastq.gz ?

You just need to compile with HAVE_LIBZ=y (for .gz) and/or with HAVE_LIBBZ2=y 
(for .bz2).

> ### Command 2 ####
> mpiexec -n 16 Ray \
>   -k \
>   31 \
>   -p \
>   MiSeq_Ecoli_MG1655_110721_PF_R1.fastq \
>   MiSeq_Ecoli_MG1655_110721_PF_R2.fastq \
>   -o \
>   Ray_paired
> ##################
>
>
>

That's really a nice test. It was really easy for me to understand what you did 
just by reading your message.

>
>
>
> Chien-Chi Lo
> Research Technologist
> Los Alamos National Laboratory
>

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Paired-end and single-end reads

Reply via email to