Estimation of genome length is based on the k-mer coverage depth frequencies.
But it is not very accurate when a genome has a lot of repeats. On Mon, 2012-03-12 at 03:36 -0400, Huanle LIU wrote: > Hi Seb, > > I found that Ray is also able to estimate genome size based on kmer > distribution (? not sure about this). > > Would you please explain it to me about how it works? > > I found this estimation is much smaller than the genome should be. > > I am assembling a plant genome, although we subsample only the > non-repetitive regions, the size should not be as small as estimated by Ray, > which is less than 10 Mb. > > Regards, > Huanle > On 08/03/2012, at 6:59 AM, Sébastien Boisvert wrote: > > > Ray computes k-mer coverage depth. > > > > Contigs contain only k-mers with high coverage depth. > > > > So yes, the assembly is reliable with Ray without quality filtering. > > > > But feel free to filter your reads and share your experience. > > > > > > On Tue, 2012-03-06 at 17:04 -0500, LIU wrote: > >> One more question. > >> If there is no quality filtering, do you think the assembly is > >> reliable? > >> > >> > >> > >> On Wed, Mar 7, 2012 at 12:54 AM, Sébastien Boisvert > >> <sebastien.boisver...@ulaval.ca> wrote: > >> On Mon, 2012-03-05 at 22:11 -0500, LIU wrote: > >>> Thanks very much for your explanation about the kmer myth. > >>> > >>> > >>> The paired reads are not equal in length because i trimmed > >> the reads > >>> based on quality. Specifically, i trimmed the reads using > >>> filting criterial -- consecutive 15 bases having quality > >> score higher > >>> than 15 (Phred Score). Some paired-end reads were also > >> broken. > >>> > >> > >> > >>> > >>> I found only 31/Library1.txt > >>> 293 1 > >>> 347 1 > >>> 359 1 > >>> 391 1 > >>> > >> > >> > >> This shows that Ray sees no paired reads in you data. > >> > >>> > >>> I do not know if i have deleted the others if they were > >> produced. > >>> > >>> > >>> > >>> > >>> The last 10 lines of 31/SeedLengthDistribution.txt are : > >>> > >>> > >>> 16892 1 > >>> 17117 1 > >>> 18662 1 > >>> 19763 1 > >>> 21295 1 > >>> 23185 1 > >>> 23416 1 > >>> 25293 1 > >>> 26018 1 > >>> 28186 1 > >>> > >> > >> > >> That is just fine. Ray uses these long DNA sequences present > >> in your > >> sample to estimate insert lengths for paired reads. > >> > >> However, it seems that Ray is unable to gather enough signal > >> for your > >> paired reads. > >> > >> Can you try without trimming your reads. I sense that maybe > >> the second > >> sequence is usually shorter than the k-mer length which > >> renders any > >> second read obsolete should it be shorter than the k-mer > >> length. > >> > >>> > >>> Thanks. > >>> > >>> > >>> Best Regards, > >>> Huanle > >>> > >>> > >>> On Tue, Mar 6, 2012 at 12:34 PM, Sébastien Boisvert > >>> <sebastien.boisver...@ulaval.ca> wrote: > >>> See my responses below. > >>> > >>> On Mon, 2012-03-05 at 18:53 -0500, LIU wrote: > >>>> Hi , > >>>> > >>>> > >>>> Thanks for your response. > >>>> > >>>> On Tue, Mar 6, 2012 at 2:21 AM, Sébastien Boisvert > >>>> <sebastien.boisver...@ulaval.ca> wrote: > >>>> 1. Using a k-mer length of 71 will > >> _presumably_ not > >>> work very > >>>> well > >>>> because of sequencing errors. First do a > >> test run at > >>> k=31. > >>>> Yes i also ran k=31. > >>>> It is the same case as k=71. > >>>> One more question about choice of kmer length. > >>>> I was also told that longer kmer is supposed to > >> produce more > >>> accurate > >>>> assembly, while shorter ones are more prone to > >> sequencing > >>> errors. > >>>> I am confused. perhaps i should open another > >> ticket to ask > >>> this > >>>> question. But i really appreciate your answer. > >>>> > >>> > >>> > >>> Using longer k-mer makes the k-mers more unique. > >>> > >>> Let's say that this is a read: > >>> > >>> * > >>> > >> > >> TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG > >>> > >>> > >>> The '*' marks a sequencing error. > >>> > >>> For 71-mers, the sliding window is: > >>> > >>> * > >>> > >> > >> TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG > >>> > >>> > >> > >> kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk > >>> > >>> So basically all the k-mers generated from that > >> sliding window > >>> contain > >>> the sequencing error. > >>> > >>> > >>> For 31-mers, the sliding window is: > >>> > >>> * > >>> > >> > >> TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG > >>> > >>> kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk > >>> > >>> > >>> So with 31-mers, you will get some erroneous k-mers > >> and some > >>> genuine > >>> k-mers. > >>> > >>> > >>>> > >>>> > >>>> > >>>> 2. Are your interleaved files properly > >> generated ? > >>>> > >>>> sequence1/1 > >>>> sequence1/2 > >>>> sequence2/1 > >>>> sequence2/2 > >>>> sequence3/1 > >>>> sequence3/2 > >>>> Yes, i think my sequences are correctlly > >>> interleaved. E.G., > >>>>> @AGRF-21_0011_FC64J74AAXX:2:1:1804:936#CCGACT/1 > >>>> > >>> > >> > >> TACATATATACATGATACATACATACATGATATATTCATATGTCACCTAAGGATGTATCATACATGATACATACATCCATGATACATACATACCG > >>>> > >>>> > >>>>> @AGRF-21_0011_FC64J74AAXX:2:1:1804:936#CCGACT/2 > >>>> > >>> > >> > >> GATGTATGTATCATGTATGATACATCCTTAGGTGACATATGAATATATCATGTATGTATGTATCATGTATATATGTATAAATATGTAT > >>>> > >>>> > >>>>> @AGRF-21_0011_FC64J74AAXX:2:1:1983:932#AATTAA/1 > >>>> TATATAGATAGATTTCA > >>>> > >>>> > >>>>> @AGRF-21_0011_FC64J74AAXX:2:1:1983:932#AATTAA/2 > >>>> > >>> > >> > >> CTTTTTTTTTGTTTCAGTCCCCGTGCTTTCAAAATTGCCCGGGTTCAGTCCCTAAGTCGTTAAGTCCGTT > >>>> In fact, i also tried velvet. It produced > >> different contigs > >>> and > >>>> scaffolds. But of course Ray and Velvet may not be > >> directly > >>> compared > >>>> because of different scaffolding strategy (i do > >> not know > >>> this, it's > >>>> simply a guess). > >>> > >>> > >>> This look ok. > >>> > >>> BUt why is the second sequence shorter than the > >> first one ? > >>> > >>> Usually, Illumina sequencing produces 2 sequences of > >> the same > >>> length for > >>> each pair of sequences. > >>> > >>>> > >>>> > >>>> Do you get anything in > >> LibraryStatistics.txt ? > >>>> > >>>> > >>>> The LibraryStatixtics are > >>>> NumberOfPairedLibraries: 3 > >>>> > >>>> > >>>> LibraryNumber: 0 > >>>> InputFormat: Interleaved,Paired > >>>> DetectionType: Automatic > >>>> > >>> > >> File: > >> /home/s4196896/mix_assembly/input/t15c15/gs1.shuffled.fasta.gz > >>>> NumberOfSequences: 248332323 > >>>> Distribution: 31/Library0.txt > >>>> > >>>> > >>>> LibraryNumber: 1 > >>>> InputFormat: Interleaved,Paired > >>>> DetectionType: Automatic > >>>> > >>> > >> File: > >> /home/s4196896/mix_assembly/input/t15c15/gs3.shuffled.fasta.gz > >>>> NumberOfSequences: 405911176 > >>>> Distribution: 31/Library1.txt > >>>> > >>>> > >>>> LibraryNumber: 2 > >>>> InputFormat: Interleaved,Paired > >>>> DetectionType: Automatic > >>>> > >>> > >> File: > >> /home/s4196896/mix_assembly/input/t15c15/gs2.shuffled.fasta.gz > >>>> NumberOfSequences: 234114234 > >>>> Distribution: 31/Library2.txt > >>>> > >>> > >>> > >>> Is there anything in 31/Library0.txt, > >> 31/Library1.txt, > >>> 31/Library2.txt > >>> > >>> > >>> Can you provide the last 10 lines of > >>> SeedLengthDistribution.txt ? > >>> > >>>> > >>>> Best Regards, > >>>> Huanle > >>>> > >>>> On Thu, 2012-03-01 at 17:06 -0500, LIU > >> wrote: > >>>>> Hi There, > >>>>> > >>>>> I have been using Ray to de novo > >> assembly. > >>>>> > >>>>> The input reads are a mix of illumina > >> pair-end > >>> reads (this > >>>> account for > >>>>> 90%), illumina single-end reads and 454 > >> single end > >>> reads. > >>>>> > >>>>> The command i used is > >>>>> mpiexec -n 60 Ray \ > >>>>> -i \ > >>>>> > >>>> > >>> > >> /home/s4196896/mix_assembly/input/t15c15/gs1.shuffled.fasta.gz \ > >>>>> -i \ > >>>>> > >>>> > >>> > >> /home/s4196896/mix_assembly/input/t15c15/gs2.shuffled.fasta.gz \ > >>>>> -i \ > >>>>> > >>>> > >>> > >> /home/s4196896/mix_assembly/input/t15c15/gs3.shuffled.fasta.gz \ > >>>>> -s \ > >>>>> > >>>> > >>> > >> /home/s4196896/mix_assembly/input/t15c15/gs2.single.fasta.gz > >>>> \ > >>>>> -s \ > >>>>> > >>>> > >>> > >> /home/s4196896/mix_assembly/input/t15c15/gs3.single.fasta.gz > >>>> \ > >>>>> -s \ > >>>>> > >>>> > >>> > >> /home/s4196896/mix_assembly/input/t15c15/gs1.single.fasta.gz > >>>> \ > >>>>> -s \ > >>>>> > >>> > >> /home/s4196896/mix_assembly/input/radseq1.seeds.fasta \ > >>>>> -s \ > >>>>> > >> /home/s4196896/mix_assembly/input/radseq_v2.fasta > >>> \ > >>>>> -s \ > >>>>> > >>>> > >>> > >> > >> /work1/s4196896/454_assembly/raw_reads/all_genomic_reads.short.fasta > >>>>> \ > >>>>> -s \ > >>>>> > >>>> > >>> > >> > >> /work1/s4196896/454_assembly/raw_reads/all_genomic_reads.long.fasta \ > >>>>> -o \ > >>>>> 71 \ > >>>>> -k \ > >>>>> 71 > >>>>> > >>>>> The output shows that scaffolds and > >> contigs are > >>> the same > >>>> (same N50, > >>>>> total number of bases and number of > >> sequences > >>> etc.). > >>>>> > >>>>> This confused me. > >>>>> > >>>>> > >>>>> I hope someone can help me out. > >>>>> > >>>>> Thanks in advance. > >>>>> > >>>>> Kind Regards, > >>>>> -- > >>>>> Huanle > >>>>> > >>>>> School of biological Sciences, UQ, QLD, > >> AU > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Huanle > >>>> > >>>> School of biological Sciences, UQ, QLD, AU > >>>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> -- > >>> Huanle > >>> > >>> School of biological Sciences, UQ, QLD, AU > >>> > >> > >> > >> > >> > >> > >> > >> -- > >> Huanle > >> > >> School of biological Sciences, UQ, QLD, AU > > > > > ------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users