> > As you mentioned that 'the assembly is reliable with Ray without > quality filtering.' I am wondering if it applied to other assemblers. > It would be very nice if you can let me know more about this. >
Ray will remove errors automatically. A k-mer has to pass several steps in order to be included in the de Bruijn graph in Ray. First, it needs to be in the Bloom filter. < 1 % of k-mers with act as they are in the Bloom filter, but they are not (false positives). Second, after being in the Bloom filter, a k-mer will learn the ways of the k-mer at the k-mer academy (KmerAcademy.cpp). Third, after being at the k-mer academy, a k-mer will be part of the graph. For other assemblers, I don't know. On Sun, 2012-03-11 at 06:51 -0400, Huanle LIU wrote: > Hi Sébastien, > > > I tried not filtering. This is the basic metrics for the assembly. > This time it made a little bit difference between scaffolds and > contigs. > > > > > Max > Contig/Scaffold length > Mean > Length > Sd > Median > N50 > Num of > Contigs/Scaffolds > Number > of > Contigs>=1kb > Number > of > Contigs in N50 > Number > of > bases > Number > of > Bases > in > >=1kb > Contigs > Scaffolds > 37121 > 175.02 > 144.46 > 135 > 166 > 3183628 > 15186 > 944002 > 557197106 > 22034806 > > 37121 > 175.28 > 171.21 > 135 > 166 > 3181307 > 14552 > 940389 > 557626009 > 23678812 > > > > > And this time it produced Library stats for all three libraries. > > > As you mentioned that 'the assembly is reliable with Ray without > quality filtering.' I am wondering if it applied to other assemblers. > It would be very nice if you can let me know more about this. > > > Best Regards, > Huanle > > > On 08/03/2012, at 6:59 AM, Sébastien Boisvert wrote: > > > Ray computes k-mer coverage depth. > > > > Contigs contain only k-mers with high coverage depth. > > > > So yes, the assembly is reliable with Ray without quality filtering. > > > > But feel free to filter your reads and share your experience. > > > > > > On Tue, 2012-03-06 at 17:04 -0500, LIU wrote: > > > One more question. > > > If there is no quality filtering, do you think the assembly is > > > reliable? > > > > > > > > > > > > On Wed, Mar 7, 2012 at 12:54 AM, Sébastien Boisvert > > > <sebastien.boisver...@ulaval.ca> wrote: > > > On Mon, 2012-03-05 at 22:11 -0500, LIU wrote: > > > > Thanks very much for your explanation about the kmer myth. > > > > > > > > > > > > The paired reads are not equal in length because i trimmed > > > the reads > > > > based on quality. Specifically, i trimmed the reads using > > > > filting criterial -- consecutive 15 bases having quality > > > score higher > > > > than 15 (Phred Score). Some paired-end reads were also > > > broken. > > > > > > > > > > > > > > > > > > I found only 31/Library1.txt > > > > 293 1 > > > > 347 1 > > > > 359 1 > > > > 391 1 > > > > > > > > > > > > > This shows that Ray sees no paired reads in you data. > > > > > > > > > > > I do not know if i have deleted the others if they were > > > produced. > > > > > > > > > > > > > > > > > > > > The last 10 lines of 31/SeedLengthDistribution.txt are : > > > > > > > > > > > > 16892 1 > > > > 17117 1 > > > > 18662 1 > > > > 19763 1 > > > > 21295 1 > > > > 23185 1 > > > > 23416 1 > > > > 25293 1 > > > > 26018 1 > > > > 28186 1 > > > > > > > > > > > > > That is just fine. Ray uses these long DNA sequences > > > present > > > in your > > > sample to estimate insert lengths for paired reads. > > > > > > However, it seems that Ray is unable to gather enough > > > signal > > > for your > > > paired reads. > > > > > > Can you try without trimming your reads. I sense that maybe > > > the second > > > sequence is usually shorter than the k-mer length which > > > renders any > > > second read obsolete should it be shorter than the k-mer > > > length. > > > > > > > > > > > Thanks. > > > > > > > > > > > > Best Regards, > > > > Huanle > > > > > > > > > > > > On Tue, Mar 6, 2012 at 12:34 PM, Sébastien Boisvert > > > > <sebastien.boisver...@ulaval.ca> wrote: > > > > See my responses below. > > > > > > > > On Mon, 2012-03-05 at 18:53 -0500, LIU wrote: > > > > > Hi , > > > > > > > > > > > > > > > Thanks for your response. > > > > > > > > > > On Tue, Mar 6, 2012 at 2:21 AM, Sébastien Boisvert > > > > > <sebastien.boisver...@ulaval.ca> wrote: > > > > > 1. Using a k-mer length of 71 will > > > _presumably_ not > > > > work very > > > > > well > > > > > because of sequencing errors. First do a > > > test run at > > > > k=31. > > > > > Yes i also ran k=31. > > > > > It is the same case as k=71. > > > > > One more question about choice of kmer length. > > > > > I was also told that longer kmer is supposed to > > > produce more > > > > accurate > > > > > assembly, while shorter ones are more prone to > > > sequencing > > > > errors. > > > > > I am confused. perhaps i should open another > > > ticket to ask > > > > this > > > > > question. But i really appreciate your answer. > > > > > > > > > > > > > > > > > Using longer k-mer makes the k-mers more unique. > > > > > > > > Let's say that this is a read: > > > > > > > > * > > > > > > > > > > TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG > > > > > > > > > > > > The '*' marks a sequencing error. > > > > > > > > For 71-mers, the sliding window is: > > > > > > > > * > > > > > > > > > > TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG > > > > > > > > > > > > > > kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk > > > > > > > > So basically all the k-mers generated from that > > > sliding window > > > > contain > > > > the sequencing error. > > > > > > > > > > > > For 31-mers, the sliding window is: > > > > > > > > * > > > > > > > > > > TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG > > > > > > > > kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk > > > > > > > > > > > > So with 31-mers, you will get some erroneous k-mers > > > and some > > > > genuine > > > > k-mers. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. Are your interleaved files properly > > > generated ? > > > > > > > > > > sequence1/1 > > > > > sequence1/2 > > > > > sequence2/1 > > > > > sequence2/2 > > > > > sequence3/1 > > > > > sequence3/2 > > > > > Yes, i think my sequences are correctlly > > > > interleaved. E.G., > > > > > > @AGRF-21_0011_FC64J74AAXX:2:1:1804:936#CCGACT/1 > > > > > > > > > > > > > > > TACATATATACATGATACATACATACATGATATATTCATATGTCACCTAAGGATGTATCATACATGATACATACATCCATGATACATACATACCG > > > > > > > > > > > > > > > > @AGRF-21_0011_FC64J74AAXX:2:1:1804:936#CCGACT/2 > > > > > > > > > > > > > > > GATGTATGTATCATGTATGATACATCCTTAGGTGACATATGAATATATCATGTATGTATGTATCATGTATATATGTATAAATATGTAT > > > > > > > > > > > > > > > > @AGRF-21_0011_FC64J74AAXX:2:1:1983:932#AATTAA/1 > > > > > TATATAGATAGATTTCA > > > > > > > > > > > > > > > > @AGRF-21_0011_FC64J74AAXX:2:1:1983:932#AATTAA/2 > > > > > > > > > > > > > > > CTTTTTTTTTGTTTCAGTCCCCGTGCTTTCAAAATTGCCCGGGTTCAGTCCCTAAGTCGTTAAGTCCGTT > > > > > In fact, i also tried velvet. It produced > > > different contigs > > > > and > > > > > scaffolds. But of course Ray and Velvet may not be > > > directly > > > > compared > > > > > because of different scaffolding strategy (i do > > > not know > > > > this, it's > > > > > simply a guess). > > > > > > > > > > > > This look ok. > > > > > > > > BUt why is the second sequence shorter than the > > > first one ? > > > > > > > > Usually, Illumina sequencing produces 2 sequences of > > > the same > > > > length for > > > > each pair of sequences. > > > > > > > > > > > > > > > > > > > Do you get anything in > > > LibraryStatistics.txt ? > > > > > > > > > > > > > > > The LibraryStatixtics are > > > > > NumberOfPairedLibraries: 3 > > > > > > > > > > > > > > > LibraryNumber: 0 > > > > > InputFormat: Interleaved,Paired > > > > > DetectionType: Automatic > > > > > > > > > > > > File: > > > /home/s4196896/mix_assembly/input/t15c15/gs1.shuffled.fasta.gz > > > > > NumberOfSequences: 248332323 > > > > > Distribution: 31/Library0.txt > > > > > > > > > > > > > > > LibraryNumber: 1 > > > > > InputFormat: Interleaved,Paired > > > > > DetectionType: Automatic > > > > > > > > > > > > File: > > > /home/s4196896/mix_assembly/input/t15c15/gs3.shuffled.fasta.gz > > > > > NumberOfSequences: 405911176 > > > > > Distribution: 31/Library1.txt > > > > > > > > > > > > > > > LibraryNumber: 2 > > > > > InputFormat: Interleaved,Paired > > > > > DetectionType: Automatic > > > > > > > > > > > > File: > > > /home/s4196896/mix_assembly/input/t15c15/gs2.shuffled.fasta.gz > > > > > NumberOfSequences: 234114234 > > > > > Distribution: 31/Library2.txt > > > > > > > > > > > > > > > > > Is there anything in 31/Library0.txt, > > > 31/Library1.txt, > > > > 31/Library2.txt > > > > > > > > > > > > Can you provide the last 10 lines of > > > > SeedLengthDistribution.txt ? > > > > > > > > > > > > > > Best Regards, > > > > > Huanle > > > > > > > > > > On Thu, 2012-03-01 at 17:06 -0500, LIU > > > wrote: > > > > > > Hi There, > > > > > > > > > > > > I have been using Ray to de novo > > > assembly. > > > > > > > > > > > > The input reads are a mix of illumina > > > pair-end > > > > reads (this > > > > > account for > > > > > > 90%), illumina single-end reads and 454 > > > single end > > > > reads. > > > > > > > > > > > > The command i used is > > > > > > mpiexec -n 60 Ray \ > > > > > > -i \ > > > > > > > > > > > > > > > > > > /home/s4196896/mix_assembly/input/t15c15/gs1.shuffled.fasta.gz \ > > > > > > -i \ > > > > > > > > > > > > > > > > > > /home/s4196896/mix_assembly/input/t15c15/gs2.shuffled.fasta.gz \ > > > > > > -i \ > > > > > > > > > > > > > > > > > > /home/s4196896/mix_assembly/input/t15c15/gs3.shuffled.fasta.gz \ > > > > > > -s \ > > > > > > > > > > > > > > > > > > /home/s4196896/mix_assembly/input/t15c15/gs2.single.fasta.gz > > > > > \ > > > > > > -s \ > > > > > > > > > > > > > > > > > > /home/s4196896/mix_assembly/input/t15c15/gs3.single.fasta.gz > > > > > \ > > > > > > -s \ > > > > > > > > > > > > > > > > > > /home/s4196896/mix_assembly/input/t15c15/gs1.single.fasta.gz > > > > > \ > > > > > > -s \ > > > > > > > > > > > > > /home/s4196896/mix_assembly/input/radseq1.seeds.fasta \ > > > > > > -s \ > > > > > > > > > /home/s4196896/mix_assembly/input/radseq_v2.fasta > > > > \ > > > > > > -s \ > > > > > > > > > > > > > > > > > > > > > /work1/s4196896/454_assembly/raw_reads/all_genomic_reads.short.fasta > > > > > > \ > > > > > > -s \ > > > > > > > > > > > > > > > > > > > > > /work1/s4196896/454_assembly/raw_reads/all_genomic_reads.long.fasta \ > > > > > > -o \ > > > > > > 71 \ > > > > > > -k \ > > > > > > 71 > > > > > > > > > > > > The output shows that scaffolds and > > > contigs are > > > > the same > > > > > (same N50, > > > > > > total number of bases and number of > > > sequences > > > > etc.). > > > > > > > > > > > > This confused me. > > > > > > > > > > > > > > > > > > I hope someone can help me out. > > > > > > > > > > > > Thanks in advance. > > > > > > > > > > > > Kind Regards, > > > > > > -- > > > > > > Huanle > > > > > > > > > > > > School of biological Sciences, UQ, QLD, > > > AU > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Huanle > > > > > > > > > > School of biological Sciences, UQ, QLD, AU > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Huanle > > > > > > > > School of biological Sciences, UQ, QLD, AU > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Huanle > > > > > > School of biological Sciences, UQ, QLD, AU > > > > > > > > ------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users