Re: [Denovoassembler-users] Duplicate sequence in contigs

Sébastien Boisvert Fri, 18 May 2012 08:18:22 -0700

See my answers below.

Le 2012-05-17 22:32, Mitchell Stanton-Cook a écrit :

Hi Seb,
Thanks for the reply.


A bit more background:
-------------------------------
This is 100 bp Illumina PE data (insert of ~300 bp s.d of about ~10%).
~ 1000X coverage. Ray (1.7) did not like this high coverage (you havepreviously commented on the user-list about this). We sampled thisdown to ~100X coverage.

The Ray 2.x.y series support a much larger coverage so you can provideyour 1000 X coverage I think.

We also cleaned the reads (you have pointed this is not necessary, buthaving a consistent cleaned set makes downstream analysis i.e. snpcalling much easier).


Yes.

All input bases have a Q score >= 30. Reads > 70bp after trimming werefiltered. After this the mean reads size is ~97 bp. We only considerread pairs (if 1 one of the sequences in the pairs fails a cleaningcriterion, both do) and hence no single end reads go into Ray.

OK.

Ray was executed like this (we used Ray's internalestimations/calculations to determine the best parameters):
mpiexec -n $PROC Ray -i XXXX.fastq -k $K -o $OUT

We looked at kmers from 15-35 in increments of 2.
The genome is ~1.8 Mb. From the assemblies it appears there are not alot repetitive elements in the genome. I have attached a csv of theresults (XXXX_ALL.csv) ( abbreviations: c= contigs, s= scaffold, N =scaffolding character).
Results for kmer 21 and kmer 23 are interesting.
kmer 21) As previously mentioned we have a 63 Kb duplication. For 63Kb the start of the two contigs are almost identical (3 mismatches and1 gap):

If Ray has not put these together, chances are that it might be aduplication in your genome.

  Score = 1.165e+05 bits (63086),  Expect = 0.0

  Identities =63093/63096  <tel:63093%2F63096>  (99%), Gaps = 1/63096 (0%)
  Strand=Plus/Plus
I have attached the alignment (XXXX_b2s.aln)

There is also a smaller 5 Kb duplication detected:

Score = 9583 bits (5189),  Expect = 0.0

  Identities = 5192/5193 (99%), Gaps = 1/5193 (0%)
  Strand=Plus/Minus

Again, Ray is quite conservative, it does not take any risk of doingmisassembly.

So this 5 kb region, in the de Bruijn graph, probably had >= 2 sourcesand >= sinks.

If you run this again with the 2.0 series (v2.0.0-rc7), there are newuseful files for this.


To verify if it is a duplication, you can check the coverage in

BiologicalAbundances/_DeNovoAssembly/*.xml

You will find an entry for any contig in there. These files containcoverage data at each position.

kmer 23) (Notice there is about 63 Kb difference between the "c 100 bptotal len" in the .csv file of kmer 21 and kmer 23). From theblast2seq there is no such 63 Kb duplication found.
Now, once again focusing on the "c 100 bp total len" in the .csvfile. I propose kmer 21 duplicate, kmer 23 no-duplicate, kmer 25non-duplicate, kmer 27 duplicate (I verified this true),


This is unusual !

kmer 29 non-duplicate. Strangely kmer 31 is missing about 200 Kb incomparison to the other assemblies.

Could check if this is still true for a 1000 X assembly for k31 with Rayv2.0.0-rc7 ?

What we are wondering is:
1) Why is there large, almost identical duplicates present in theassemblies?

In Ray, things are done in parallel. Sometimes, two processor cores willcomputethe same contigs. But later, Ray will merge things together in aconservative way.

2) Why do we see it in some kmers and not others? (I could understandif lower kmer = duplicates, higher kmers = non-duplicates if theseduplicates are propagated by a sequencing error)

There is a lot of combinatorics going on. I don't have a definite answerfor that.


But I would be curious to see if this happens with Ray 2.0.0-rc7.

3) Is this a bug or an inherit issue with graph based assembly?


I think it is just a behavior of the algorithm.

4) If possible, how can we fix this?


Can you try your 1000 X data, raw, with v2.0.0-rc7 ?

5) How can I help you with this?
I hope I have provided you with enough information. If not please letme know.
p.s. I have know problem with this going to the list as long as theattachments are not included.


If you compress your files, you should be fine.


Regards

Mitch

On Fri, May 18, 2012 at 12:25 AM, Sébastien Boisvert<sebastien.boisver...@ulaval.ca<mailto:sebastien.boisver...@ulaval.ca>> wrote:


    Hi,

    Is the 63 kB perfectly duplicated in your assembly ?


    Le 2012-05-17 01:26, Mitchell Stanton-Cook a écrit :

        Hi Seb,

        Hope all is well.

        I was wondering if you have ever seen duplicate sequence in
        contigs.

        We have ~63 kB duplicate at the start of two different
        contigs. Beyond this it's unique.

        This is Ray 1.7.

        I'm re-running with Ray2.0rc5.

        I came across these posts (ABySS specific):

        
http://groups.google.com/group/abyss-users/browse_thread/thread/264e894b4ec0c96d/30dab75afa686878
        
http://groups.google.com/group/abyss-users/browse_thread/thread/7a03ee033b11afc4
        
http://groups.google.com/group/abyss-users/browse_thread/thread/f0f3a650bd12cf1e


        Any ideas?

        Regards

        Mitch

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Duplicate sequence in contigs

Reply via email to