Ray computes k-mer coverage depth. Contigs contain only k-mers with high coverage depth.
So yes, the assembly is reliable with Ray without quality filtering. But feel free to filter your reads and share your experience. On Tue, 2012-03-06 at 17:04 -0500, LIU wrote: > One more question. > If there is no quality filtering, do you think the assembly is > reliable? > > > > On Wed, Mar 7, 2012 at 12:54 AM, Sébastien Boisvert > <[email protected]> wrote: > On Mon, 2012-03-05 at 22:11 -0500, LIU wrote: > > Thanks very much for your explanation about the kmer myth. > > > > > > The paired reads are not equal in length because i trimmed > the reads > > based on quality. Specifically, i trimmed the reads using > > filting criterial -- consecutive 15 bases having quality > score higher > > than 15 (Phred Score). Some paired-end reads were also > broken. > > > > > > > > I found only 31/Library1.txt > > 293 1 > > 347 1 > > 359 1 > > 391 1 > > > > > This shows that Ray sees no paired reads in you data. > > > > > I do not know if i have deleted the others if they were > produced. > > > > > > > > > > The last 10 lines of 31/SeedLengthDistribution.txt are : > > > > > > 16892 1 > > 17117 1 > > 18662 1 > > 19763 1 > > 21295 1 > > 23185 1 > > 23416 1 > > 25293 1 > > 26018 1 > > 28186 1 > > > > > That is just fine. Ray uses these long DNA sequences present > in your > sample to estimate insert lengths for paired reads. > > However, it seems that Ray is unable to gather enough signal > for your > paired reads. > > Can you try without trimming your reads. I sense that maybe > the second > sequence is usually shorter than the k-mer length which > renders any > second read obsolete should it be shorter than the k-mer > length. > > > > > Thanks. > > > > > > Best Regards, > > Huanle > > > > > > On Tue, Mar 6, 2012 at 12:34 PM, Sébastien Boisvert > > <[email protected]> wrote: > > See my responses below. > > > > On Mon, 2012-03-05 at 18:53 -0500, LIU wrote: > > > Hi , > > > > > > > > > Thanks for your response. > > > > > > On Tue, Mar 6, 2012 at 2:21 AM, Sébastien Boisvert > > > <[email protected]> wrote: > > > 1. Using a k-mer length of 71 will > _presumably_ not > > work very > > > well > > > because of sequencing errors. First do a > test run at > > k=31. > > > Yes i also ran k=31. > > > It is the same case as k=71. > > > One more question about choice of kmer length. > > > I was also told that longer kmer is supposed to > produce more > > accurate > > > assembly, while shorter ones are more prone to > sequencing > > errors. > > > I am confused. perhaps i should open another > ticket to ask > > this > > > question. But i really appreciate your answer. > > > > > > > > > Using longer k-mer makes the k-mers more unique. > > > > Let's say that this is a read: > > > > * > > > > TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG > > > > > > The '*' marks a sequencing error. > > > > For 71-mers, the sliding window is: > > > > * > > > > TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG > > > > > > kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk > > > > So basically all the k-mers generated from that > sliding window > > contain > > the sequencing error. > > > > > > For 31-mers, the sliding window is: > > > > * > > > > TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG > > > > kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk > > > > > > So with 31-mers, you will get some erroneous k-mers > and some > > genuine > > k-mers. > > > > > > > > > > > > > > > > 2. Are your interleaved files properly > generated ? > > > > > > sequence1/1 > > > sequence1/2 > > > sequence2/1 > > > sequence2/2 > > > sequence3/1 > > > sequence3/2 > > > Yes, i think my sequences are correctlly > > interleaved. E.G., > > > >@AGRF-21_0011_FC64J74AAXX:2:1:1804:936#CCGACT/1 > > > > > > > TACATATATACATGATACATACATACATGATATATTCATATGTCACCTAAGGATGTATCATACATGATACATACATCCATGATACATACATACCG > > > > > > > > > >@AGRF-21_0011_FC64J74AAXX:2:1:1804:936#CCGACT/2 > > > > > > > GATGTATGTATCATGTATGATACATCCTTAGGTGACATATGAATATATCATGTATGTATGTATCATGTATATATGTATAAATATGTAT > > > > > > > > > >@AGRF-21_0011_FC64J74AAXX:2:1:1983:932#AATTAA/1 > > > TATATAGATAGATTTCA > > > > > > > > > >@AGRF-21_0011_FC64J74AAXX:2:1:1983:932#AATTAA/2 > > > > > > CTTTTTTTTTGTTTCAGTCCCCGTGCTTTCAAAATTGCCCGGGTTCAGTCCCTAAGTCGTTAAGTCCGTT > > > In fact, i also tried velvet. It produced > different contigs > > and > > > scaffolds. But of course Ray and Velvet may not be > directly > > compared > > > because of different scaffolding strategy (i do > not know > > this, it's > > > simply a guess). > > > > > > This look ok. > > > > BUt why is the second sequence shorter than the > first one ? > > > > Usually, Illumina sequencing produces 2 sequences of > the same > > length for > > each pair of sequences. > > > > > > > > > > > Do you get anything in > LibraryStatistics.txt ? > > > > > > > > > The LibraryStatixtics are > > > NumberOfPairedLibraries: 3 > > > > > > > > > LibraryNumber: 0 > > > InputFormat: Interleaved,Paired > > > DetectionType: Automatic > > > > > > File: /home/s4196896/mix_assembly/input/t15c15/gs1.shuffled.fasta.gz > > > NumberOfSequences: 248332323 > > > Distribution: 31/Library0.txt > > > > > > > > > LibraryNumber: 1 > > > InputFormat: Interleaved,Paired > > > DetectionType: Automatic > > > > > > File: /home/s4196896/mix_assembly/input/t15c15/gs3.shuffled.fasta.gz > > > NumberOfSequences: 405911176 > > > Distribution: 31/Library1.txt > > > > > > > > > LibraryNumber: 2 > > > InputFormat: Interleaved,Paired > > > DetectionType: Automatic > > > > > > File: /home/s4196896/mix_assembly/input/t15c15/gs2.shuffled.fasta.gz > > > NumberOfSequences: 234114234 > > > Distribution: 31/Library2.txt > > > > > > > > > Is there anything in 31/Library0.txt, > 31/Library1.txt, > > 31/Library2.txt > > > > > > Can you provide the last 10 lines of > > SeedLengthDistribution.txt ? > > > > > > > > Best Regards, > > > Huanle > > > > > > On Thu, 2012-03-01 at 17:06 -0500, LIU > wrote: > > > > Hi There, > > > > > > > > I have been using Ray to de novo > assembly. > > > > > > > > The input reads are a mix of illumina > pair-end > > reads (this > > > account for > > > > 90%), illumina single-end reads and 454 > single end > > reads. > > > > > > > > The command i used is > > > > mpiexec -n 60 Ray \ > > > > -i \ > > > > > > > > > > /home/s4196896/mix_assembly/input/t15c15/gs1.shuffled.fasta.gz \ > > > > -i \ > > > > > > > > > > /home/s4196896/mix_assembly/input/t15c15/gs2.shuffled.fasta.gz \ > > > > -i \ > > > > > > > > > > /home/s4196896/mix_assembly/input/t15c15/gs3.shuffled.fasta.gz \ > > > > -s \ > > > > > > > > > > /home/s4196896/mix_assembly/input/t15c15/gs2.single.fasta.gz > > > \ > > > > -s \ > > > > > > > > > > /home/s4196896/mix_assembly/input/t15c15/gs3.single.fasta.gz > > > \ > > > > -s \ > > > > > > > > > > /home/s4196896/mix_assembly/input/t15c15/gs1.single.fasta.gz > > > \ > > > > -s \ > > > > > > > /home/s4196896/mix_assembly/input/radseq1.seeds.fasta \ > > > > -s \ > > > > > /home/s4196896/mix_assembly/input/radseq_v2.fasta > > \ > > > > -s \ > > > > > > > > > > /work1/s4196896/454_assembly/raw_reads/all_genomic_reads.short.fasta > > > > \ > > > > -s \ > > > > > > > > > > /work1/s4196896/454_assembly/raw_reads/all_genomic_reads.long.fasta \ > > > > -o \ > > > > 71 \ > > > > -k \ > > > > 71 > > > > > > > > The output shows that scaffolds and > contigs are > > the same > > > (same N50, > > > > total number of bases and number of > sequences > > etc.). > > > > > > > > This confused me. > > > > > > > > > > > > I hope someone can help me out. > > > > > > > > Thanks in advance. > > > > > > > > Kind Regards, > > > > -- > > > > Huanle > > > > > > > > School of biological Sciences, UQ, QLD, > AU > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Huanle > > > > > > School of biological Sciences, UQ, QLD, AU > > > > > > > > > > > > > > > > > > > -- > > Huanle > > > > School of biological Sciences, UQ, QLD, AU > > > > > > > > > -- > Huanle > > School of biological Sciences, UQ, QLD, AU ------------------------------------------------------------------------------ Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ _______________________________________________ Denovoassembler-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
