Re: [Denovoassembler-users] Ray produced the same scaffolds and contigs

Sébastien Boisvert Wed, 07 Mar 2012 12:56:34 -0800

Ray computes k-mer coverage depth.

Contigs contain only k-mers with high coverage depth.


So yes, the assembly is reliable with Ray without quality filtering.

But feel free to filter your reads and share your experience.


On Tue, 2012-03-06 at 17:04 -0500, LIU wrote:
> One more question.
> If there is no quality filtering, do you think the assembly is
> reliable?
> 
> 
> 
> On Wed, Mar 7, 2012 at 12:54 AM, Sébastien Boisvert
> <[email protected]> wrote:
>         On Mon, 2012-03-05 at 22:11 -0500, LIU wrote:
>         > Thanks very much for your explanation about the kmer myth.
>         >
>         >
>         > The paired reads are not equal in length because i trimmed
>         the reads
>         > based on quality. Specifically, i trimmed the reads using
>         > filting criterial -- consecutive 15 bases having quality
>         score higher
>         > than 15 (Phred Score). Some paired-end reads were also
>         broken.
>         >
>         
>         
>         >
>         > I found only 31/Library1.txt
>         > 293 1
>         > 347 1
>         > 359 1
>         > 391 1
>         >
>         
>         
>         This shows that Ray sees no paired reads in you data.
>         
>         >
>         > I do not know if i  have deleted the others if they were
>         produced.
>         >
>         >
>         >
>         >
>         > The last 10 lines of 31/SeedLengthDistribution.txt are :
>         >
>         >
>         > 16892 1
>         > 17117 1
>         > 18662 1
>         > 19763 1
>         > 21295 1
>         > 23185 1
>         > 23416 1
>         > 25293 1
>         > 26018 1
>         > 28186 1
>         >
>         
>         
>         That is just fine. Ray uses these long DNA sequences present
>         in your
>         sample to estimate insert lengths for paired reads.
>         
>         However, it seems that Ray is unable to gather enough signal
>         for your
>         paired reads.
>         
>         Can you try without trimming your reads. I sense that maybe
>         the second
>         sequence is usually shorter than the k-mer length which
>         renders any
>         second read obsolete should it be shorter than the k-mer
>         length.
>         
>         >
>         > Thanks.
>         >
>         >
>         > Best Regards,
>         > Huanle
>         >
>         >
>         > On Tue, Mar 6, 2012 at 12:34 PM, Sébastien Boisvert
>         > <[email protected]> wrote:
>         >         See my responses below.
>         >
>         >         On Mon, 2012-03-05 at 18:53 -0500, LIU wrote:
>         >         > Hi ,
>         >         >
>         >         >
>         >         > Thanks for your response.
>         >         >
>         >         > On Tue, Mar 6, 2012 at 2:21 AM, Sébastien Boisvert
>         >         > <[email protected]> wrote:
>         >         >         1. Using a k-mer length of 71 will
>         _presumably_ not
>         >         work very
>         >         >         well
>         >         >         because of sequencing errors. First do a
>         test run at
>         >         k=31.
>         >         > Yes i also ran k=31.
>         >         > It is the same case as k=71.
>         >         > One more question about choice of kmer length.
>         >         > I was also told that longer kmer is supposed to
>         produce more
>         >         accurate
>         >         > assembly, while shorter ones are more prone to
>         sequencing
>         >         errors.
>         >         > I am confused. perhaps  i should open another
>         ticket to ask
>         >         this
>         >         > question. But i really appreciate your answer.
>         >         >
>         >
>         >
>         >         Using longer k-mer makes the k-mers more unique.
>         >
>         >         Let's say that this is a read:
>         >
>         >                                         *
>         >
>         
> TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG
>         >
>         >
>         >         The '*' marks a sequencing error.
>         >
>         >         For 71-mers, the sliding window is:
>         >
>         >                                         *
>         >
>         
> TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG
>         >
>         >
>         
> kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
>         >
>         >         So basically all the k-mers generated from that
>         sliding window
>         >         contain
>         >         the sequencing error.
>         >
>         >
>         >         For 31-mers, the sliding window is:
>         >
>         >                                         *
>         >
>         
> TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG
>         >
>         >         kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
>         >
>         >
>         >         So with 31-mers, you will get some erroneous k-mers
>         and some
>         >         genuine
>         >         k-mers.
>         >
>         >
>         >         >
>         >         >
>         >         >
>         >         >         2. Are your interleaved files properly
>         generated ?
>         >         >
>         >         >         sequence1/1
>         >         >         sequence1/2
>         >         >         sequence2/1
>         >         >         sequence2/2
>         >         >         sequence3/1
>         >         >         sequence3/2
>         >         >         Yes, i think my sequences are correctlly
>         >         interleaved. E.G.,
>         >         > >@AGRF-21_0011_FC64J74AAXX:2:1:1804:936#CCGACT/1
>         >         >
>         >
>         
> TACATATATACATGATACATACATACATGATATATTCATATGTCACCTAAGGATGTATCATACATGATACATACATCCATGATACATACATACCG
>         >         >
>         >         >
>         >         > >@AGRF-21_0011_FC64J74AAXX:2:1:1804:936#CCGACT/2
>         >         >
>         >
>         
> GATGTATGTATCATGTATGATACATCCTTAGGTGACATATGAATATATCATGTATGTATGTATCATGTATATATGTATAAATATGTAT
>         >         >
>         >         >
>         >         > >@AGRF-21_0011_FC64J74AAXX:2:1:1983:932#AATTAA/1
>         >         > TATATAGATAGATTTCA
>         >         >
>         >         >
>         >         > >@AGRF-21_0011_FC64J74AAXX:2:1:1983:932#AATTAA/2
>         >         >
>         >
>         CTTTTTTTTTGTTTCAGTCCCCGTGCTTTCAAAATTGCCCGGGTTCAGTCCCTAAGTCGTTAAGTCCGTT
>         >         >  In fact, i also tried velvet. It produced
>         different contigs
>         >         and
>         >         > scaffolds. But of course Ray and Velvet may not be
>         directly
>         >         compared
>         >         > because of different scaffolding strategy (i do
>         not know
>         >         this, it's
>         >         > simply a guess).
>         >
>         >
>         >         This look ok.
>         >
>         >         BUt why is the second sequence shorter than the
>         first one ?
>         >
>         >         Usually, Illumina sequencing produces 2 sequences of
>         the same
>         >         length for
>         >         each pair of sequences.
>         >
>         >         >
>         >         >
>         >         >         Do you get anything in
>         LibraryStatistics.txt ?
>         >         >
>         >         >
>         >         > The LibraryStatixtics are
>         >         >    NumberOfPairedLibraries: 3
>         >         >
>         >         >
>         >         > LibraryNumber: 0
>         >         >  InputFormat: Interleaved,Paired
>         >         >  DetectionType: Automatic
>         >         >
>         >
>          File: /home/s4196896/mix_assembly/input/t15c15/gs1.shuffled.fasta.gz
>         >         >   NumberOfSequences: 248332323
>         >         >  Distribution: 31/Library0.txt
>         >         >
>         >         >
>         >         > LibraryNumber: 1
>         >         >  InputFormat: Interleaved,Paired
>         >         >  DetectionType: Automatic
>         >         >
>         >
>          File: /home/s4196896/mix_assembly/input/t15c15/gs3.shuffled.fasta.gz
>         >         >   NumberOfSequences: 405911176
>         >         >  Distribution: 31/Library1.txt
>         >         >
>         >         >
>         >         > LibraryNumber: 2
>         >         >  InputFormat: Interleaved,Paired
>         >         >  DetectionType: Automatic
>         >         >
>         >
>          File: /home/s4196896/mix_assembly/input/t15c15/gs2.shuffled.fasta.gz
>         >         >   NumberOfSequences: 234114234
>         >         >  Distribution: 31/Library2.txt
>         >         >
>         >
>         >
>         >         Is there anything in 31/Library0.txt,
>          31/Library1.txt,
>         >          31/Library2.txt
>         >
>         >
>         >         Can you provide the last 10 lines of
>         >         SeedLengthDistribution.txt ?
>         >
>         >         >
>         >         > Best Regards,
>         >         > Huanle
>         >         >
>         >         >         On Thu, 2012-03-01 at 17:06 -0500, LIU
>         wrote:
>         >         >         > Hi There,
>         >         >         >
>         >         >         > I have been using Ray to de novo
>         assembly.
>         >         >         >
>         >         >         > The input reads are a mix of illumina
>         pair-end
>         >         reads (this
>         >         >         account for
>         >         >         > 90%), illumina single-end reads and 454
>         single end
>         >         reads.
>         >         >         >
>         >         >         > The command i used is
>         >         >         > mpiexec -n 60 Ray \
>         >         >         >  -i \
>         >         >         >
>         >         >
>         >
>          /home/s4196896/mix_assembly/input/t15c15/gs1.shuffled.fasta.gz \
>         >         >         >  -i \
>         >         >         >
>         >         >
>         >
>          /home/s4196896/mix_assembly/input/t15c15/gs2.shuffled.fasta.gz \
>         >         >         >  -i \
>         >         >         >
>         >         >
>         >
>          /home/s4196896/mix_assembly/input/t15c15/gs3.shuffled.fasta.gz \
>         >         >         >  -s \
>         >         >         >
>         >         >
>         >
>          /home/s4196896/mix_assembly/input/t15c15/gs2.single.fasta.gz
>         >         >         \
>         >         >         >  -s \
>         >         >         >
>         >         >
>         >
>          /home/s4196896/mix_assembly/input/t15c15/gs3.single.fasta.gz
>         >         >         \
>         >         >         >  -s \
>         >         >         >
>         >         >
>         >
>          /home/s4196896/mix_assembly/input/t15c15/gs1.single.fasta.gz
>         >         >         \
>         >         >         >  -s \
>         >         >         >
>         >
>          /home/s4196896/mix_assembly/input/radseq1.seeds.fasta \
>         >         >         >  -s \
>         >         >         >
>          /home/s4196896/mix_assembly/input/radseq_v2.fasta
>         >         \
>         >         >         >  -s \
>         >         >         >
>         >         >
>         >
>          /work1/s4196896/454_assembly/raw_reads/all_genomic_reads.short.fasta
>         >         >         > \
>         >         >         >  -s \
>         >         >         >
>         >         >
>         >
>          /work1/s4196896/454_assembly/raw_reads/all_genomic_reads.long.fasta \
>         >         >         >  -o \
>         >         >         >  71 \
>         >         >         >  -k \
>         >         >         >  71
>         >         >         >
>         >         >         > The output shows that scaffolds and
>         contigs are
>         >         the same
>         >         >         (same N50,
>         >         >         > total number of bases and number of
>         sequences
>         >         etc.).
>         >         >         >
>         >         >         > This confused me.
>         >         >         >
>         >         >         >
>         >         >         > I hope someone can help me out.
>         >         >         >
>         >         >         > Thanks in advance.
>         >         >         >
>         >         >         > Kind Regards,
>         >         >         > --
>         >         >         > Huanle
>         >         >         >
>         >         >         > School of biological Sciences, UQ, QLD,
>         AU
>         >         >
>         >         >
>         >         >
>         >         >
>         >         >
>         >         >
>         >         >
>         >         > --
>         >         > Huanle
>         >         >
>         >         > School of biological Sciences, UQ, QLD, AU
>         >         >
>         >
>         >
>         >
>         >
>         >
>         >
>         >
>         > --
>         > Huanle
>         >
>         > School of biological Sciences, UQ, QLD, AU
>         >
>         
>         
>         
> 
> 
> 
> -- 
> Huanle 
> 
> School of biological Sciences, UQ, QLD, AU



------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Ray produced the same scaffolds and contigs

Reply via email to