Re: [Denovoassembler-users] Ray produced the same scaffolds and contigs

Sébastien Boisvert Tue, 13 Mar 2012 08:04:32 -0700

> 
> As you mentioned that 'the assembly is reliable with Ray without
> quality filtering.'  I am wondering if it applied to other assemblers.
> It would be very nice if you can let me know more about this.
>



Ray will remove errors automatically. A k-mer has to pass several steps
in order to be included in the de Bruijn graph in Ray.

First, it needs to be in the Bloom filter. < 1 % of k-mers with act as
they are in the Bloom filter, but they are not (false positives). 

Second, after being in the Bloom filter, a k-mer will learn the ways of
the k-mer at the k-mer academy (KmerAcademy.cpp).

Third, after being at the k-mer academy, a k-mer will be part of the
graph.


For other assemblers, I don't know.


On Sun, 2012-03-11 at 06:51 -0400, Huanle LIU wrote:
> Hi Sébastien,
> 
> 
> I tried not filtering. This is the basic metrics for the assembly.
> This time it made a little bit difference between scaffolds and
> contigs. 
> 
> 
> 
> 
> Max
> Contig/Scaffold length
> Mean
> Length
> Sd
> Median
> N50
> Num of
> Contigs/Scaffolds
> Number
> of
> Contigs>=1kb
> Number
> of
> Contigs in N50
> Number
> of
> bases
> Number
> of
> Bases
> in
> >=1kb
> Contigs
> Scaffolds
>  37121
> 175.02
> 144.46
>    135
>    166
> 3183628
>  15186
> 944002
> 557197106
> 22034806
> 
>  37121
> 175.28
> 171.21
>    135
>    166
> 3181307
>  14552
> 940389
> 557626009
> 23678812
> 
> 
> 
> 
> And this time it produced Library stats for all three libraries. 
> 
> 
> As you mentioned that 'the assembly is reliable with Ray without
> quality filtering.'  I am wondering if it applied to other assemblers.
> It would be very nice if you can let me know more about this.
> 
> 
> Best Regards,
> Huanle 
> 
> 
> On 08/03/2012, at 6:59 AM, Sébastien Boisvert wrote:
> 
> > Ray computes k-mer coverage depth.
> > 
> > Contigs contain only k-mers with high coverage depth.
> > 
> > So yes, the assembly is reliable with Ray without quality filtering.
> > 
> > But feel free to filter your reads and share your experience.
> > 
> > 
> > On Tue, 2012-03-06 at 17:04 -0500, LIU wrote:
> > > One more question.
> > > If there is no quality filtering, do you think the assembly is
> > > reliable?
> > > 
> > > 
> > > 
> > > On Wed, Mar 7, 2012 at 12:54 AM, Sébastien Boisvert
> > > <sebastien.boisver...@ulaval.ca> wrote:
> > >        On Mon, 2012-03-05 at 22:11 -0500, LIU wrote:
> > > > Thanks very much for your explanation about the kmer myth.
> > > > 
> > > > 
> > > > The paired reads are not equal in length because i trimmed
> > >        the reads
> > > > based on quality. Specifically, i trimmed the reads using
> > > > filting criterial -- consecutive 15 bases having quality
> > >        score higher
> > > > than 15 (Phred Score). Some paired-end reads were also
> > >        broken.
> > > > 
> > > 
> > > 
> > > > 
> > > > I found only 31/Library1.txt
> > > > 293 1
> > > > 347 1
> > > > 359 1
> > > > 391 1
> > > > 
> > > 
> > > 
> > >        This shows that Ray sees no paired reads in you data.
> > > 
> > > > 
> > > > I do not know if i  have deleted the others if they were
> > >        produced.
> > > > 
> > > > 
> > > > 
> > > > 
> > > > The last 10 lines of 31/SeedLengthDistribution.txt are :
> > > > 
> > > > 
> > > > 16892 1
> > > > 17117 1
> > > > 18662 1
> > > > 19763 1
> > > > 21295 1
> > > > 23185 1
> > > > 23416 1
> > > > 25293 1
> > > > 26018 1
> > > > 28186 1
> > > > 
> > > 
> > > 
> > >        That is just fine. Ray uses these long DNA sequences
> > > present
> > >        in your
> > >        sample to estimate insert lengths for paired reads.
> > > 
> > >        However, it seems that Ray is unable to gather enough
> > > signal
> > >        for your
> > >        paired reads.
> > > 
> > >        Can you try without trimming your reads. I sense that maybe
> > >        the second
> > >        sequence is usually shorter than the k-mer length which
> > >        renders any
> > >        second read obsolete should it be shorter than the k-mer
> > >        length.
> > > 
> > > > 
> > > > Thanks.
> > > > 
> > > > 
> > > > Best Regards,
> > > > Huanle
> > > > 
> > > > 
> > > > On Tue, Mar 6, 2012 at 12:34 PM, Sébastien Boisvert
> > > > <sebastien.boisver...@ulaval.ca> wrote:
> > > >        See my responses below.
> > > > 
> > > >        On Mon, 2012-03-05 at 18:53 -0500, LIU wrote:
> > > > > Hi ,
> > > > > 
> > > > > 
> > > > > Thanks for your response.
> > > > > 
> > > > > On Tue, Mar 6, 2012 at 2:21 AM, Sébastien Boisvert
> > > > > <sebastien.boisver...@ulaval.ca> wrote:
> > > > >        1. Using a k-mer length of 71 will
> > >        _presumably_ not
> > > >        work very
> > > > >        well
> > > > >        because of sequencing errors. First do a
> > >        test run at
> > > >        k=31.
> > > > > Yes i also ran k=31.
> > > > > It is the same case as k=71.
> > > > > One more question about choice of kmer length.
> > > > > I was also told that longer kmer is supposed to
> > >        produce more
> > > >        accurate
> > > > > assembly, while shorter ones are more prone to
> > >        sequencing
> > > >        errors.
> > > > > I am confused. perhaps  i should open another
> > >        ticket to ask
> > > >        this
> > > > > question. But i really appreciate your answer.
> > > > > 
> > > > 
> > > > 
> > > >        Using longer k-mer makes the k-mers more unique.
> > > > 
> > > >        Let's say that this is a read:
> > > > 
> > > >                                        *
> > > > 
> > >        
> > > TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG
> > > > 
> > > > 
> > > >        The '*' marks a sequencing error.
> > > > 
> > > >        For 71-mers, the sliding window is:
> > > > 
> > > >                                        *
> > > > 
> > >        
> > > TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG
> > > > 
> > > > 
> > >        
> > > kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
> > > > 
> > > >        So basically all the k-mers generated from that
> > >        sliding window
> > > >        contain
> > > >        the sequencing error.
> > > > 
> > > > 
> > > >        For 31-mers, the sliding window is:
> > > > 
> > > >                                        *
> > > > 
> > >        
> > > TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG
> > > > 
> > > >        kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
> > > > 
> > > > 
> > > >        So with 31-mers, you will get some erroneous k-mers
> > >        and some
> > > >        genuine
> > > >        k-mers.
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > >        2. Are your interleaved files properly
> > >        generated ?
> > > > > 
> > > > >        sequence1/1
> > > > >        sequence1/2
> > > > >        sequence2/1
> > > > >        sequence2/2
> > > > >        sequence3/1
> > > > >        sequence3/2
> > > > >        Yes, i think my sequences are correctlly
> > > >        interleaved. E.G.,
> > > > > > @AGRF-21_0011_FC64J74AAXX:2:1:1804:936#CCGACT/1
> > > > > 
> > > > 
> > >        
> > > TACATATATACATGATACATACATACATGATATATTCATATGTCACCTAAGGATGTATCATACATGATACATACATCCATGATACATACATACCG
> > > > > 
> > > > > 
> > > > > > @AGRF-21_0011_FC64J74AAXX:2:1:1804:936#CCGACT/2
> > > > > 
> > > > 
> > >        
> > > GATGTATGTATCATGTATGATACATCCTTAGGTGACATATGAATATATCATGTATGTATGTATCATGTATATATGTATAAATATGTAT
> > > > > 
> > > > > 
> > > > > > @AGRF-21_0011_FC64J74AAXX:2:1:1983:932#AATTAA/1
> > > > > TATATAGATAGATTTCA
> > > > > 
> > > > > 
> > > > > > @AGRF-21_0011_FC64J74AAXX:2:1:1983:932#AATTAA/2
> > > > > 
> > > > 
> > >        
> > > CTTTTTTTTTGTTTCAGTCCCCGTGCTTTCAAAATTGCCCGGGTTCAGTCCCTAAGTCGTTAAGTCCGTT
> > > > > In fact, i also tried velvet. It produced
> > >        different contigs
> > > >        and
> > > > > scaffolds. But of course Ray and Velvet may not be
> > >        directly
> > > >        compared
> > > > > because of different scaffolding strategy (i do
> > >        not know
> > > >        this, it's
> > > > > simply a guess).
> > > > 
> > > > 
> > > >        This look ok.
> > > > 
> > > >        BUt why is the second sequence shorter than the
> > >        first one ?
> > > > 
> > > >        Usually, Illumina sequencing produces 2 sequences of
> > >        the same
> > > >        length for
> > > >        each pair of sequences.
> > > > 
> > > > > 
> > > > > 
> > > > >        Do you get anything in
> > >        LibraryStatistics.txt ?
> > > > > 
> > > > > 
> > > > > The LibraryStatixtics are
> > > > >   NumberOfPairedLibraries: 3
> > > > > 
> > > > > 
> > > > > LibraryNumber: 0
> > > > > InputFormat: Interleaved,Paired
> > > > > DetectionType: Automatic
> > > > > 
> > > > 
> > >         File: 
> > > /home/s4196896/mix_assembly/input/t15c15/gs1.shuffled.fasta.gz
> > > > >  NumberOfSequences: 248332323
> > > > > Distribution: 31/Library0.txt
> > > > > 
> > > > > 
> > > > > LibraryNumber: 1
> > > > > InputFormat: Interleaved,Paired
> > > > > DetectionType: Automatic
> > > > > 
> > > > 
> > >         File: 
> > > /home/s4196896/mix_assembly/input/t15c15/gs3.shuffled.fasta.gz
> > > > >  NumberOfSequences: 405911176
> > > > > Distribution: 31/Library1.txt
> > > > > 
> > > > > 
> > > > > LibraryNumber: 2
> > > > > InputFormat: Interleaved,Paired
> > > > > DetectionType: Automatic
> > > > > 
> > > > 
> > >         File: 
> > > /home/s4196896/mix_assembly/input/t15c15/gs2.shuffled.fasta.gz
> > > > >  NumberOfSequences: 234114234
> > > > > Distribution: 31/Library2.txt
> > > > > 
> > > > 
> > > > 
> > > >        Is there anything in 31/Library0.txt,
> > >         31/Library1.txt,
> > > >         31/Library2.txt
> > > > 
> > > > 
> > > >        Can you provide the last 10 lines of
> > > >        SeedLengthDistribution.txt ?
> > > > 
> > > > > 
> > > > > Best Regards,
> > > > > Huanle
> > > > > 
> > > > >        On Thu, 2012-03-01 at 17:06 -0500, LIU
> > >        wrote:
> > > > > > Hi There,
> > > > > > 
> > > > > > I have been using Ray to de novo
> > >        assembly.
> > > > > > 
> > > > > > The input reads are a mix of illumina
> > >        pair-end
> > > >        reads (this
> > > > >        account for
> > > > > > 90%), illumina single-end reads and 454
> > >        single end
> > > >        reads.
> > > > > > 
> > > > > > The command i used is
> > > > > > mpiexec -n 60 Ray \
> > > > > > -i \
> > > > > > 
> > > > > 
> > > > 
> > >         /home/s4196896/mix_assembly/input/t15c15/gs1.shuffled.fasta.gz \
> > > > > > -i \
> > > > > > 
> > > > > 
> > > > 
> > >         /home/s4196896/mix_assembly/input/t15c15/gs2.shuffled.fasta.gz \
> > > > > > -i \
> > > > > > 
> > > > > 
> > > > 
> > >         /home/s4196896/mix_assembly/input/t15c15/gs3.shuffled.fasta.gz \
> > > > > > -s \
> > > > > > 
> > > > > 
> > > > 
> > >         /home/s4196896/mix_assembly/input/t15c15/gs2.single.fasta.gz
> > > > >        \
> > > > > > -s \
> > > > > > 
> > > > > 
> > > > 
> > >         /home/s4196896/mix_assembly/input/t15c15/gs3.single.fasta.gz
> > > > >        \
> > > > > > -s \
> > > > > > 
> > > > > 
> > > > 
> > >         /home/s4196896/mix_assembly/input/t15c15/gs1.single.fasta.gz
> > > > >        \
> > > > > > -s \
> > > > > > 
> > > > 
> > >         /home/s4196896/mix_assembly/input/radseq1.seeds.fasta \
> > > > > > -s \
> > > > > > 
> > >         /home/s4196896/mix_assembly/input/radseq_v2.fasta
> > > >        \
> > > > > > -s \
> > > > > > 
> > > > > 
> > > > 
> > >         
> > > /work1/s4196896/454_assembly/raw_reads/all_genomic_reads.short.fasta
> > > > > > \
> > > > > > -s \
> > > > > > 
> > > > > 
> > > > 
> > >         
> > > /work1/s4196896/454_assembly/raw_reads/all_genomic_reads.long.fasta \
> > > > > > -o \
> > > > > > 71 \
> > > > > > -k \
> > > > > > 71
> > > > > > 
> > > > > > The output shows that scaffolds and
> > >        contigs are
> > > >        the same
> > > > >        (same N50,
> > > > > > total number of bases and number of
> > >        sequences
> > > >        etc.).
> > > > > > 
> > > > > > This confused me.
> > > > > > 
> > > > > > 
> > > > > > I hope someone can help me out.
> > > > > > 
> > > > > > Thanks in advance.
> > > > > > 
> > > > > > Kind Regards,
> > > > > > --
> > > > > > Huanle
> > > > > > 
> > > > > > School of biological Sciences, UQ, QLD,
> > >        AU
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > --
> > > > > Huanle
> > > > > 
> > > > > School of biological Sciences, UQ, QLD, AU
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > --
> > > > Huanle
> > > > 
> > > > School of biological Sciences, UQ, QLD, AU
> > > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > -- 
> > > Huanle 
> > > 
> > > School of biological Sciences, UQ, QLD, AU
> > 
> > 
> > 
> 
> 



------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Ray produced the same scaffolds and contigs

Reply via email to