Hi,
From the slide 8, I understand that the problem of reads being reused
within the same
contig is not observed. Right ?
You say that the number of contigs >= 100 is 120. What about those >= 500 ?
If your reads are 101, then if you have duplicates you will have some
contigs that are just
duplicated reads.
If you get the same number of contigs and scaffolds, it usually means
that paired information
are not sufficient for further scaffolding.
For scaffolding, you may want to try other tools such as the popular SSPACE.
What coverage depth do you have ?
Did you try a Ray assembly with your raw data without any filtering ?
Do you think that the missing 90 kb (slide 9) is linear or it is the sum
of several missing
regions ?
For v2.0.0-rc7 k=17, is the 5 kb sequences occuring in the same contig ?
Mitchell Stanton-Cook a écrit :
Hi Seb,
Following up on our earlier discussion, Nouri (a postdoc in the lab)
has prepared a set of .ppt slides that explain our issues. Hope this
helps to illustrate out concerns.
Regards
Mitch
---------- Forwarded message ----------
From: *Nouri BEN ZAKOUR* <n.benzak...@uq.edu.au
<mailto:n.benzak...@uq.edu.au>>
Date: Tue, May 22, 2012 at 5:39 PM
Subject: Re: Duplicate sequence in contigs
To: Mitchell Stanton-Cook <m.stantonc...@gmail.com
<mailto:m.stantonc...@gmail.com>>
=====================================
Nouri BEN ZAKOUR, PhD
Post-doctoral researcher
School of Chemistry & Molecular Biosciences
University of Queensland
Brisbane, QLD 4072, Australia
Tel: +61 7 33653691 <tel:%2B61%207%2033653691>
Fax: +61 7 33654699 <tel:%2B61%207%2033654699>
=====================================
On 22/05/2012, at 1:34 PM, Mitchell Stanton-Cook wrote:
---------- Forwarded message ----------
From: *Sébastien Boisvert* <sebastien.boisver...@ulaval.ca
<mailto:sebastien.boisver...@ulaval.ca>>
Date: Sat, May 19, 2012 at 1:27 AM
Subject: Re: Duplicate sequence in contigs
To: Mitchell Stanton-Cook <m.stantonc...@gmail.com
<mailto:m.stantonc...@gmail.com>>, "deno >>
\"denovoassembler-users@lists.sourceforge.net
<mailto:denovoassembler-users@lists.sourceforge.net>\""
<denovoassembler-users@lists.sourceforge.net
<mailto:denovoassembler-users@lists.sourceforge.net>>
Hi,
The contig is co-linear with itself !
I think that computing the same with v2.0.0-rc7 and then checking the
coverage readouts
in RayOutput/BiologicalAbundances/_DeNovoAssembly will shed light on
this.
Also, I saw that you used k-mer length of 33. So you must have
compiled with
MAXKMERLENGTH=64 or something like that. There was some users that
had some
consistency problems when changing MAXKMERLENGTH from the default
which is 32.
Maybe this is related. MAXKMERLENGTH=64 does not crash, but there is
maybe still
some bugs. And I think I fixed a few of these bugs since v1.7.
v1.7 was post-assemblathon. v2.0.0 has many changes in it, aside
from the new RayPlatform.
Sébastien
Le 2012-05-18 03:46, Mitchell Stanton-Cook a écrit :
Hi Seb,
Sorry about the bombardment.
I thought you would like this. It's a self-self dot plot of the
"pseudomolecule" of the concatenated contigs.
Regards
Mitch
On Fri, May 18, 2012 at 3:46 PM, Mitchell Stanton-Cook
<m.stantonc...@gmail.com <mailto:m.stantonc...@gmail.com>> wrote:
Hi Seb,
We have looked at a few more things while you have been sleeping ;-)
We re-ran the assemblies to get the .afg files and loaded them
into hawkeye.
It appears that somehow Ray duplicates reads. That is, there are
more reads in the contigs then actually went into the assembly.
(TotalReads < ReadsInContigs)
The difference is equal to the SingletonReads value (which is
negative!)
Somewhere Ray is artificially duplicating (certain?) reads.
I did a quick idiot check at my end (although I don't think it
matters) to make sure we did not send duplicate reads into Ray
(i.e. identical4 line fastq block) and that is all OK.
We have also noticed, that in some contigs there appears to be
stretches of 0 coverage... We're not to sure of what's going on
here.
Hope this is of use.
Cheers
Mitch
On Fri, May 18, 2012 at 12:32 PM, Mitchell Stanton-Cook
<m.stantonc...@gmail.com <mailto:m.stantonc...@gmail.com>> wrote:
Hi Seb,
Thanks for the reply.
A bit more background:
-------------------------------
This is 100 bp Illumina PE data (insert of ~300 bp s.d of
about ~10%).
~ 1000X coverage. Ray (1.7) did not like this high coverage
(you have previously commented on the user-list about this).
We sampled this down to ~100X coverage.
We also cleaned the reads (you have pointed this is not
necessary, but having a consistent cleaned set makes
downstream analysis i.e. snp calling much easier). All input
bases have a Q score >= 30. Reads > 70bp after trimming were
filtered. After this the mean reads size is ~97 bp. We only
consider read pairs (if 1 one of the sequences in the pairs
fails a cleaning criterion, both do) and hence no single end
reads go into Ray.
Ray was executed like this (we used Ray's internal
estimations/calculations to determine the best parameters):
mpiexec -n $PROC Ray -i XXXX.fastq -k $K -o $OUT
We looked at kmers from 15-35 in increments of 2.
The genome is ~1.8 Mb. From the assemblies it appears there
are not a lot repetitive elements in the genome. I have
attached a csv of the results (XXXX_ALL.csv) (
abbreviations: c= contigs, s= scaffold, N = scaffolding
character).
Results for kmer 21 and kmer 23 are interesting.
kmer 21) As previously mentioned we have a 63 Kb
duplication. For 63 Kb the start of the two contigs are
almost identical (3 mismatches and 1 gap):
Score = 1.165e+05 bits (63086), Expect = 0.0
Identities =63093/63096 <tel:63093%2F63096> (99%), Gaps = 1/63096
(0%)
Strand=Plus/Plus
I have attached the alignment (XXXX_b2s.aln)
There is also a smaller 5 Kb duplication detected:
Score = 9583 bits (5189), Expect = 0.0
Identities = 5192/5193 (99%), Gaps = 1/5193 (0%)
Strand=Plus/Minus
kmer 23) (Notice there is about 63 Kb difference between the
"c 100 bp total len" in the .csv file of kmer 21 and kmer
23). From the blast2seq there is no such 63 Kb duplication
found.
Now, once again focusing on the "c 100 bp total len" in the
.csv file. I propose kmer 21 duplicate, kmer 23
no-duplicate, kmer 25 non-duplicate, kmer 27 duplicate (I
verified this true), kmer 29 non-duplicate. Strangely kmer
31 is missing about 200 Kb in comparison to the other
assemblies.
What we are wondering is:
1) Why is there large, almost identical duplicates present
in the assemblies?
2) Why do we see it in some kmers and not others? (I could
understand if lower kmer = duplicates, higher kmers =
non-duplicates if these duplicates are propagated by a
sequencing error)
3) Is this a bug or an inherit issue with graph based assembly?
4) If possible, how can we fix this?
5) How can I help you with this?
I hope I have provided you with enough information. If not
please let me know.
p.s. I have know problem with this going to the list as long
as the attachments are not included.
Regards
Mitch
On Fri, May 18, 2012 at 12:25 AM, Sébastien Boisvert
<sebastien.boisver...@ulaval.ca
<mailto:sebastien.boisver...@ulaval.ca>> wrote:
Hi,
Is the 63 kB perfectly duplicated in your assembly ?
Le 2012-05-17 01:26, Mitchell Stanton-Cook a écrit :
Hi Seb,
Hope all is well.
I was wondering if you have ever seen duplicate
sequence in contigs.
We have ~63 kB duplicate at the start of two
different contigs. Beyond this it's unique.
This is Ray 1.7.
I'm re-running with Ray2.0rc5.
I came across these posts (ABySS specific):
http://groups.google.com/group/abyss-users/browse_thread/thread/264e894b4ec0c96d/30dab75afa686878
http://groups.google.com/group/abyss-users/browse_thread/thread/7a03ee033b11afc4
http://groups.google.com/group/abyss-users/browse_thread/thread/f0f3a650bd12cf1e
Any ideas?
Regards
Mitch
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users