Hello,
[Please CC the user mailing list]
On 03/05/2013 06:50 PM, Lo, Chien-Chi wrote:
> Hi Sébastien,
>
> I have a question about how Ray weight on the paired end reads. I ran a
> Ecoli Miseq paired end data using Ray 2.1.0 twice.
> First, I treat all
> reads are single-ended and secondly I ran Ray in paired-end mode. The
> commands I used are pasted below. The result contigs' number is very
> different. Why paired-end mode generate better result?
Let me tell you a short story.
With your DNA reads, Ray builds a graph. The graph is distributed uniformily
onto all
the MPI ranks.
Until recently, it was impossible to see such a graph with energy.
Here's such a graph for Escherichia coli DH10B sequenced on a Illumina(R)
MiSeq(R) 2x250 and assembled with Ray using a k-mer length of 91.
http://browser.cloud.raytrek.com/client/?map=0§ion=3®ion=4&location=178944
Once this graph is built, the Ray MPI ranks will begin an extraordinary journey
in which they will
collectively, as a tribe, perform parallel graph traversals using heuristics.
This journey is
extraordinary because of the sheer amount of messages that are passed between
MPI ranks -- the
computation granularity is in the order of 10-70 microseconds.
The heuristics in Ray use paired reads, mate pairs, and single-end read
threading to perform
the graph traversals. So Ray uses paired information for other purposes than
just scaffolding.
This is also the case with some other assemblers too, like ABySS.
These original Ray heuristics are described in this open-access publication:
http://online.liebertpub.com/doi/abs/10.1089/cmb.2009.0238
Recently, our group have generalized these heuristics and other parts of the
algorithms to handle
mixes of genomes. The new algorithms work well on bacterial genomes, on
metagenomes, and also on
transcriptomes. However, the algorithms probably do not handle alternative
splicing very well.
These generalizations are described in this open-access publication:
http://genomebiology.com/2012/13/12/R122/abstract
> Thanks!
>
> For single-ended,
> Contigs >= 500 nt
> Number: 583
> Total length: 4357532
> Average: 7474
> N50: 10494
> Median: 5905
> Largest: 38443
> For paired-ended,
> Contigs >= 500 nt
> Number: 87
> Total length: 4603370
> Average: 52912
> N50: 106053
> Median: 35487
> Largest: 269165
>
>
Also, at the moment, the algorithms are better with pairs of reads than with
single-end reads
because the algorithms will match the outer distances to the empirical
distribution of signal for each library.
We devised some other neat algorithms on pairs, such as this:
Constrained traversal of repeats with paired sequences.
Sébastien Boisvert, Élénie Godzaridis, François Laviolette & Jacques
Corbeil.
First Annual RECOMB Satellite Workshop on Massively Parallel Sequencing,
March 26-27 2011, Vancouver, BC, Canada.
abstract: http://boisvert.info/publications/RECOMB-seq-2011-abstract.html
presentation: http://www.boisvert.info/dropbox/recomb-seq-2011-talk.pdf
The read recycling in Ray actually is something that is not implemented in
other assemblers, as far as I know.
I fixed a very rare bug in the read recycling code yesterday --
https://github.com/sebhtml/ray/commit/354c02bc7f3e963fb22809c3a5176e5f8d6cba26
For long reads, the matching algorithms are not devised to handle insertions
very well, except in the case
where any insertion can be mated with a deletion (and vice-versa).
>
>
>
>
>
> ### Command 1 ####
> mpiexec -n 16 Ray \
> -k \
> 31 \
> -s \
> MiSeq_Ecoli_MG1655_110721_PF_R1.fastq \
> -s \
> MiSeq_Ecoli_MG1655_110721_PF_R2.fastq \
> -o \
> Ray_single
> ####################
>
>
Did you know that Ray will accept natively compressed files, such as .fastq.bz2
or .fastq.gz ?
You just need to compile with HAVE_LIBZ=y (for .gz) and/or with HAVE_LIBBZ2=y
(for .bz2).
> ### Command 2 ####
> mpiexec -n 16 Ray \
> -k \
> 31 \
> -p \
> MiSeq_Ecoli_MG1655_110721_PF_R1.fastq \
> MiSeq_Ecoli_MG1655_110721_PF_R2.fastq \
> -o \
> Ray_paired
> ##################
>
>
>
That's really a nice test. It was really easy for me to understand what you did
just by reading your message.
>
>
>
> Chien-Chi Lo
> Research Technologist
> Los Alamos National Laboratory
>
------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
endpoint security space. For insight on selecting the right partner to
tackle endpoint security challenges, access the full report.
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users