Estimation of genome length is based on the k-mer coverage depth
frequencies.

But it is not very accurate when a genome has a lot of repeats.



On Mon, 2012-03-12 at 03:36 -0400, Huanle LIU wrote:
> Hi Seb,
> 
> I found that Ray is also able to estimate genome size based on kmer 
> distribution (? not sure about this).
> 
> Would you please explain it to me about how it works?
> 
> I found this estimation is much smaller than the genome should be.
> 
> I am assembling  a plant genome, although we subsample only the 
> non-repetitive regions, the size should not be as small as estimated by Ray, 
> which is less than 10 Mb.
> 
> Regards,
> Huanle
> On 08/03/2012, at 6:59 AM, Sébastien Boisvert wrote:
> 
> > Ray computes k-mer coverage depth.
> > 
> > Contigs contain only k-mers with high coverage depth.
> > 
> > So yes, the assembly is reliable with Ray without quality filtering.
> > 
> > But feel free to filter your reads and share your experience.
> > 
> > 
> > On Tue, 2012-03-06 at 17:04 -0500, LIU wrote:
> >> One more question.
> >> If there is no quality filtering, do you think the assembly is
> >> reliable?
> >> 
> >> 
> >> 
> >> On Wed, Mar 7, 2012 at 12:54 AM, Sébastien Boisvert
> >> <sebastien.boisver...@ulaval.ca> wrote:
> >>        On Mon, 2012-03-05 at 22:11 -0500, LIU wrote:
> >>> Thanks very much for your explanation about the kmer myth.
> >>> 
> >>> 
> >>> The paired reads are not equal in length because i trimmed
> >>        the reads
> >>> based on quality. Specifically, i trimmed the reads using
> >>> filting criterial -- consecutive 15 bases having quality
> >>        score higher
> >>> than 15 (Phred Score). Some paired-end reads were also
> >>        broken.
> >>> 
> >> 
> >> 
> >>> 
> >>> I found only 31/Library1.txt
> >>> 293 1
> >>> 347 1
> >>> 359 1
> >>> 391 1
> >>> 
> >> 
> >> 
> >>        This shows that Ray sees no paired reads in you data.
> >> 
> >>> 
> >>> I do not know if i  have deleted the others if they were
> >>        produced.
> >>> 
> >>> 
> >>> 
> >>> 
> >>> The last 10 lines of 31/SeedLengthDistribution.txt are :
> >>> 
> >>> 
> >>> 16892 1
> >>> 17117 1
> >>> 18662 1
> >>> 19763 1
> >>> 21295 1
> >>> 23185 1
> >>> 23416 1
> >>> 25293 1
> >>> 26018 1
> >>> 28186 1
> >>> 
> >> 
> >> 
> >>        That is just fine. Ray uses these long DNA sequences present
> >>        in your
> >>        sample to estimate insert lengths for paired reads.
> >> 
> >>        However, it seems that Ray is unable to gather enough signal
> >>        for your
> >>        paired reads.
> >> 
> >>        Can you try without trimming your reads. I sense that maybe
> >>        the second
> >>        sequence is usually shorter than the k-mer length which
> >>        renders any
> >>        second read obsolete should it be shorter than the k-mer
> >>        length.
> >> 
> >>> 
> >>> Thanks.
> >>> 
> >>> 
> >>> Best Regards,
> >>> Huanle
> >>> 
> >>> 
> >>> On Tue, Mar 6, 2012 at 12:34 PM, Sébastien Boisvert
> >>> <sebastien.boisver...@ulaval.ca> wrote:
> >>>        See my responses below.
> >>> 
> >>>        On Mon, 2012-03-05 at 18:53 -0500, LIU wrote:
> >>>> Hi ,
> >>>> 
> >>>> 
> >>>> Thanks for your response.
> >>>> 
> >>>> On Tue, Mar 6, 2012 at 2:21 AM, Sébastien Boisvert
> >>>> <sebastien.boisver...@ulaval.ca> wrote:
> >>>>        1. Using a k-mer length of 71 will
> >>        _presumably_ not
> >>>        work very
> >>>>        well
> >>>>        because of sequencing errors. First do a
> >>        test run at
> >>>        k=31.
> >>>> Yes i also ran k=31.
> >>>> It is the same case as k=71.
> >>>> One more question about choice of kmer length.
> >>>> I was also told that longer kmer is supposed to
> >>        produce more
> >>>        accurate
> >>>> assembly, while shorter ones are more prone to
> >>        sequencing
> >>>        errors.
> >>>> I am confused. perhaps  i should open another
> >>        ticket to ask
> >>>        this
> >>>> question. But i really appreciate your answer.
> >>>> 
> >>> 
> >>> 
> >>>        Using longer k-mer makes the k-mers more unique.
> >>> 
> >>>        Let's say that this is a read:
> >>> 
> >>>                                        *
> >>> 
> >>        
> >> TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG
> >>> 
> >>> 
> >>>        The '*' marks a sequencing error.
> >>> 
> >>>        For 71-mers, the sliding window is:
> >>> 
> >>>                                        *
> >>> 
> >>        
> >> TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG
> >>> 
> >>> 
> >>        
> >> kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
> >>> 
> >>>        So basically all the k-mers generated from that
> >>        sliding window
> >>>        contain
> >>>        the sequencing error.
> >>> 
> >>> 
> >>>        For 31-mers, the sliding window is:
> >>> 
> >>>                                        *
> >>> 
> >>        
> >> TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG
> >>> 
> >>>        kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
> >>> 
> >>> 
> >>>        So with 31-mers, you will get some erroneous k-mers
> >>        and some
> >>>        genuine
> >>>        k-mers.
> >>> 
> >>> 
> >>>> 
> >>>> 
> >>>> 
> >>>>        2. Are your interleaved files properly
> >>        generated ?
> >>>> 
> >>>>        sequence1/1
> >>>>        sequence1/2
> >>>>        sequence2/1
> >>>>        sequence2/2
> >>>>        sequence3/1
> >>>>        sequence3/2
> >>>>        Yes, i think my sequences are correctlly
> >>>        interleaved. E.G.,
> >>>>> @AGRF-21_0011_FC64J74AAXX:2:1:1804:936#CCGACT/1
> >>>> 
> >>> 
> >>        
> >> TACATATATACATGATACATACATACATGATATATTCATATGTCACCTAAGGATGTATCATACATGATACATACATCCATGATACATACATACCG
> >>>> 
> >>>> 
> >>>>> @AGRF-21_0011_FC64J74AAXX:2:1:1804:936#CCGACT/2
> >>>> 
> >>> 
> >>        
> >> GATGTATGTATCATGTATGATACATCCTTAGGTGACATATGAATATATCATGTATGTATGTATCATGTATATATGTATAAATATGTAT
> >>>> 
> >>>> 
> >>>>> @AGRF-21_0011_FC64J74AAXX:2:1:1983:932#AATTAA/1
> >>>> TATATAGATAGATTTCA
> >>>> 
> >>>> 
> >>>>> @AGRF-21_0011_FC64J74AAXX:2:1:1983:932#AATTAA/2
> >>>> 
> >>> 
> >>        
> >> CTTTTTTTTTGTTTCAGTCCCCGTGCTTTCAAAATTGCCCGGGTTCAGTCCCTAAGTCGTTAAGTCCGTT
> >>>> In fact, i also tried velvet. It produced
> >>        different contigs
> >>>        and
> >>>> scaffolds. But of course Ray and Velvet may not be
> >>        directly
> >>>        compared
> >>>> because of different scaffolding strategy (i do
> >>        not know
> >>>        this, it's
> >>>> simply a guess).
> >>> 
> >>> 
> >>>        This look ok.
> >>> 
> >>>        BUt why is the second sequence shorter than the
> >>        first one ?
> >>> 
> >>>        Usually, Illumina sequencing produces 2 sequences of
> >>        the same
> >>>        length for
> >>>        each pair of sequences.
> >>> 
> >>>> 
> >>>> 
> >>>>        Do you get anything in
> >>        LibraryStatistics.txt ?
> >>>> 
> >>>> 
> >>>> The LibraryStatixtics are
> >>>>   NumberOfPairedLibraries: 3
> >>>> 
> >>>> 
> >>>> LibraryNumber: 0
> >>>> InputFormat: Interleaved,Paired
> >>>> DetectionType: Automatic
> >>>> 
> >>> 
> >>         File: 
> >> /home/s4196896/mix_assembly/input/t15c15/gs1.shuffled.fasta.gz
> >>>>  NumberOfSequences: 248332323
> >>>> Distribution: 31/Library0.txt
> >>>> 
> >>>> 
> >>>> LibraryNumber: 1
> >>>> InputFormat: Interleaved,Paired
> >>>> DetectionType: Automatic
> >>>> 
> >>> 
> >>         File: 
> >> /home/s4196896/mix_assembly/input/t15c15/gs3.shuffled.fasta.gz
> >>>>  NumberOfSequences: 405911176
> >>>> Distribution: 31/Library1.txt
> >>>> 
> >>>> 
> >>>> LibraryNumber: 2
> >>>> InputFormat: Interleaved,Paired
> >>>> DetectionType: Automatic
> >>>> 
> >>> 
> >>         File: 
> >> /home/s4196896/mix_assembly/input/t15c15/gs2.shuffled.fasta.gz
> >>>>  NumberOfSequences: 234114234
> >>>> Distribution: 31/Library2.txt
> >>>> 
> >>> 
> >>> 
> >>>        Is there anything in 31/Library0.txt,
> >>         31/Library1.txt,
> >>>         31/Library2.txt
> >>> 
> >>> 
> >>>        Can you provide the last 10 lines of
> >>>        SeedLengthDistribution.txt ?
> >>> 
> >>>> 
> >>>> Best Regards,
> >>>> Huanle
> >>>> 
> >>>>        On Thu, 2012-03-01 at 17:06 -0500, LIU
> >>        wrote:
> >>>>> Hi There,
> >>>>> 
> >>>>> I have been using Ray to de novo
> >>        assembly.
> >>>>> 
> >>>>> The input reads are a mix of illumina
> >>        pair-end
> >>>        reads (this
> >>>>        account for
> >>>>> 90%), illumina single-end reads and 454
> >>        single end
> >>>        reads.
> >>>>> 
> >>>>> The command i used is
> >>>>> mpiexec -n 60 Ray \
> >>>>> -i \
> >>>>> 
> >>>> 
> >>> 
> >>         /home/s4196896/mix_assembly/input/t15c15/gs1.shuffled.fasta.gz \
> >>>>> -i \
> >>>>> 
> >>>> 
> >>> 
> >>         /home/s4196896/mix_assembly/input/t15c15/gs2.shuffled.fasta.gz \
> >>>>> -i \
> >>>>> 
> >>>> 
> >>> 
> >>         /home/s4196896/mix_assembly/input/t15c15/gs3.shuffled.fasta.gz \
> >>>>> -s \
> >>>>> 
> >>>> 
> >>> 
> >>         /home/s4196896/mix_assembly/input/t15c15/gs2.single.fasta.gz
> >>>>        \
> >>>>> -s \
> >>>>> 
> >>>> 
> >>> 
> >>         /home/s4196896/mix_assembly/input/t15c15/gs3.single.fasta.gz
> >>>>        \
> >>>>> -s \
> >>>>> 
> >>>> 
> >>> 
> >>         /home/s4196896/mix_assembly/input/t15c15/gs1.single.fasta.gz
> >>>>        \
> >>>>> -s \
> >>>>> 
> >>> 
> >>         /home/s4196896/mix_assembly/input/radseq1.seeds.fasta \
> >>>>> -s \
> >>>>> 
> >>         /home/s4196896/mix_assembly/input/radseq_v2.fasta
> >>>        \
> >>>>> -s \
> >>>>> 
> >>>> 
> >>> 
> >>         
> >> /work1/s4196896/454_assembly/raw_reads/all_genomic_reads.short.fasta
> >>>>> \
> >>>>> -s \
> >>>>> 
> >>>> 
> >>> 
> >>         
> >> /work1/s4196896/454_assembly/raw_reads/all_genomic_reads.long.fasta \
> >>>>> -o \
> >>>>> 71 \
> >>>>> -k \
> >>>>> 71
> >>>>> 
> >>>>> The output shows that scaffolds and
> >>        contigs are
> >>>        the same
> >>>>        (same N50,
> >>>>> total number of bases and number of
> >>        sequences
> >>>        etc.).
> >>>>> 
> >>>>> This confused me.
> >>>>> 
> >>>>> 
> >>>>> I hope someone can help me out.
> >>>>> 
> >>>>> Thanks in advance.
> >>>>> 
> >>>>> Kind Regards,
> >>>>> --
> >>>>> Huanle
> >>>>> 
> >>>>> School of biological Sciences, UQ, QLD,
> >>        AU
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> --
> >>>> Huanle
> >>>> 
> >>>> School of biological Sciences, UQ, QLD, AU
> >>>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> --
> >>> Huanle
> >>> 
> >>> School of biological Sciences, UQ, QLD, AU
> >>> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> -- 
> >> Huanle 
> >> 
> >> School of biological Sciences, UQ, QLD, AU
> > 
> > 
> 



------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to