Hello, > Hello everyone, > > I'm attempting to assemble a genome using data from 2 lanes of the Illumina > HiSeq, totaling ~250M 104 bp paired end reads with an insert size of ~450bp. > We estimate the genome size just under 2Gbp. This would roughly compute to > 30x coverage assuming it all maps to our genome. >
250 M 104 paired-end reads gives a raw coverage of 13, not 30. irb(main):003:0> rawBases=250*1000*1000*104 => 26000000000 irb(main):004:0> genomeBases=2*1000*1000*1000 => 2000000000 irb(main):005:0> rawCoverage=(rawBases+0.0)/genomeBases => 13.0 Because of sequencing errors, the usable k-mer coverage will be lower than 13. The first thing you want to do is a Ray quality assurance run with k=21. If everything runs fine, Ray should detect a peak and the average insert size for your libraries. Then if that works, you can start playing with the k-mer length. > I'm attempting to use Ray (v1.6.1-rc3) and am struggling to find a setting > that both proves to finish in reasonable amount > of time and constructs a reasonable assembly. I've noticed under the default > settings the minimum kmer coverage is set > to one less than the peak coverage (which does not appear to be the same as > the max coverage Minimum, peak and repeat coverages are detected in the coverage distribution (PREFIX.CoverageDistribution.txt). The minimum k-mer coverage is not set to one less than peak coverage. Besides, v1.6.1 is on sourceforge. > and is often in 500-700 range) and this leads to the exclusion of far too > many Kmers (or so it appears), and assembles an > awful genome, with n50 in the 100's. > Can you post PREFIX.CoverageDistribution.txt with k=21 on http://pastebin.com/ and link it in your next email. Example: http://pastebin.com/BJRYwzBZ The corresponding CoverageDistributionAnalysis.txt file: k-mer length: 31 Lowest coverage observed: 1 MinimumCoverage: 42 PeakCoverage: 171 RepeatCoverage: 300 Number of k-mers with at least MinimumCoverage: 2453478644 k-mers Estimated genome length: 1226739322 nucleotides Percentage of vertices with coverage 1: 83.7745 % DistributionFile: parrot-Testbed-A2-k31-20110719-9c8b02dbd.CoverageDistribution.txt So, you can see that Ray finds the peak coverage automatically when there is one. > Below is my Kmer distribution: > Why do you only have powers of 2 ? It is not generated by Ray, Ray measures coverage values from 1 to 65535. > Kmer coverage bin Frequency (k=61) Frequencey (k=31) > 2 11422 12689 > 4 3461 5764 > 8 2570 5380 > 16 2191 4239 > 32 1753 3382 > 64 1130 2386 > 128 923 1804 > 256 727 1308 > 512 491 954 > 1024 345 684 > 2048 269 487 > 4096 199 375 > 8192 159 260 > 16384 111 188 > 32768 75 137 > 65536 55 92 > 131072 40 67 > 262144 28 46 > 524288 20 33 > 1048576 14 24 > 2097152 10 16 > 4194304 6 11 > 8388608 4 5 > 16777216 3 4 > 33554432 2 4 > 67108864 2 3 > 134217728 2 4 > 268435456 2 5 > 536870912 3 7« > 1073741824 3 1 > 2147483648 1 0 > 4294967296 0 0 > 8589934592 1 1 > > Does anyone have suggestions for Kmer values and coverage minimums to set? > Well, a good start for quality assurance is to run Ray with -k 21 (default value) and look in PREFIX.CoverageDistribution.txt and PREFIX.CoverageDistributionAnalysis.txt to see if a peak was detected. For Illumina data, Ray should do well. But I don't know if a raw coverage of 13 is enough. Statistically, you will get a lot of uncovered regions according to the Lander-Waterman statistics. http://en.wikipedia.org/wiki/DNA_sequencing_theory#Lander-Waterman_theory > Thanks for your help, > > Walter > Best. Sébastien http://github.com/sebhtml ------------------------------------------------------------------------------ 10 Tips for Better Web Security Learn 10 ways to better secure your business today. Topics covered include: Web security, SSL, hacker attacks & Denial of Service (DoS), private keys, security Microsoft Exchange, secure Instant Messaging, and much more. http://www.accelacomm.com/jaw/sfnl/114/51426210/ _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users