Le 2012-05-18 01:46, Mitchell Stanton-Cook a écrit :
Hi Seb,
We have looked at a few more things while you have been sleeping ;-)
We re-ran the assemblies to get the .afg files and loaded them into
hawkeye.
It appears that somehow Ray duplicates reads. That is, there are more
reads in the contigs then actually went into the assembly.
Velvet also has this feature.
Newbler can also split one read in two contigs.
(TotalReads < ReadsInContigs)
The difference is equal to the SingletonReads value (which is negative!)
Somewhere Ray is artificially duplicating (certain?) reads.
I think all de Bruijn assemblers will have some of the reads in more
than 1 contig.
But for Ray, any read can not be occuring twice in the same contig, in
theory.
But it can in many contigs.
I did a quick idiot check at my end (although I don't think it
matters) to make sure we did not send duplicate reads into Ray (i.e.
identical4 line fastq block) and that is all OK.
OK.
We have also noticed, that in some contigs there appears to be
stretches of 0 coverage... We're not to sure of what's going on here.
In Hawkeye ?
Do you also see this with Tablet ?
Hope this is of use.
Cheers
Mitch
On Fri, May 18, 2012 at 12:32 PM, Mitchell Stanton-Cook
<m.stantonc...@gmail.com <mailto:m.stantonc...@gmail.com>> wrote:
Hi Seb,
Thanks for the reply.
A bit more background:
-------------------------------
This is 100 bp Illumina PE data (insert of ~300 bp s.d of about ~10%).
~ 1000X coverage. Ray (1.7) did not like this high coverage (you
have previously commented on the user-list about this). We sampled
this down to ~100X coverage.
We also cleaned the reads (you have pointed this is not necessary,
but having a consistent cleaned set makes downstream analysis i.e.
snp calling much easier). All input bases have a Q score >= 30.
Reads > 70bp after trimming were filtered. After this the mean
reads size is ~97 bp. We only consider read pairs (if 1 one of the
sequences in the pairs fails a cleaning criterion, both do) and
hence no single end reads go into Ray.
Ray was executed like this (we used Ray's internal
estimations/calculations to determine the best parameters):
mpiexec -n $PROC Ray -i XXXX.fastq -k $K -o $OUT
We looked at kmers from 15-35 in increments of 2.
The genome is ~1.8 Mb. From the assemblies it appears there are
not a lot repetitive elements in the genome. I have attached a csv
of the results (XXXX_ALL.csv) ( abbreviations: c= contigs, s=
scaffold, N = scaffolding character).
Results for kmer 21 and kmer 23 are interesting.
kmer 21) As previously mentioned we have a 63 Kb duplication. For
63 Kb the start of the two contigs are almost identical (3
mismatches and 1 gap):
Score = 1.165e+05 bits (63086), Expect = 0.0
Identities =63093/63096 <tel:63093%2F63096> (99%), Gaps = 1/63096 (0%)
Strand=Plus/Plus
I have attached the alignment (XXXX_b2s.aln)
There is also a smaller 5 Kb duplication detected:
Score = 9583 bits (5189), Expect = 0.0
Identities = 5192/5193 (99%), Gaps = 1/5193 (0%)
Strand=Plus/Minus
kmer 23) (Notice there is about 63 Kb difference between the "c
100 bp total len" in the .csv file of kmer 21 and kmer 23). From
the blast2seq there is no such 63 Kb duplication found.
Now, once again focusing on the "c 100 bp total len" in the .csv
file. I propose kmer 21 duplicate, kmer 23 no-duplicate, kmer 25
non-duplicate, kmer 27 duplicate (I verified this true), kmer 29
non-duplicate. Strangely kmer 31 is missing about 200 Kb in
comparison to the other assemblies.
What we are wondering is:
1) Why is there large, almost identical duplicates present in the
assemblies?
2) Why do we see it in some kmers and not others? (I could
understand if lower kmer = duplicates, higher kmers =
non-duplicates if these duplicates are propagated by a sequencing
error)
3) Is this a bug or an inherit issue with graph based assembly?
4) If possible, how can we fix this?
5) How can I help you with this?
I hope I have provided you with enough information. If not please
let me know.
p.s. I have know problem with this going to the list as long as
the attachments are not included.
Regards
Mitch
On Fri, May 18, 2012 at 12:25 AM, Sébastien Boisvert
<sebastien.boisver...@ulaval.ca
<mailto:sebastien.boisver...@ulaval.ca>> wrote:
Hi,
Is the 63 kB perfectly duplicated in your assembly ?
Le 2012-05-17 01:26, Mitchell Stanton-Cook a écrit :
Hi Seb,
Hope all is well.
I was wondering if you have ever seen duplicate sequence
in contigs.
We have ~63 kB duplicate at the start of two different
contigs. Beyond this it's unique.
This is Ray 1.7.
I'm re-running with Ray2.0rc5.
I came across these posts (ABySS specific):
http://groups.google.com/group/abyss-users/browse_thread/thread/264e894b4ec0c96d/30dab75afa686878
http://groups.google.com/group/abyss-users/browse_thread/thread/7a03ee033b11afc4
http://groups.google.com/group/abyss-users/browse_thread/thread/f0f3a650bd12cf1e
Any ideas?
Regards
Mitch
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users