Hello,

> Hello everyone,
> 
> I'm attempting to assemble a genome using data from 2 lanes of the Illumina 
> HiSeq, totaling ~250M 104 bp paired end reads with an insert size of ~450bp.  
> We estimate the genome size just under 2Gbp.  This would roughly compute to 
> 30x coverage assuming it all maps to our genome.
> 

250 M 104 paired-end reads gives a raw coverage of 13, not 30.

irb(main):003:0> rawBases=250*1000*1000*104
=> 26000000000
irb(main):004:0> genomeBases=2*1000*1000*1000
=> 2000000000
irb(main):005:0> rawCoverage=(rawBases+0.0)/genomeBases
=> 13.0


Because of sequencing errors, the usable k-mer coverage will be lower than 13.

The first thing you want to do is a Ray quality assurance run with k=21.

If everything runs fine, Ray should detect a peak and the average insert size 
for your libraries.

Then if that works, you can start playing with the k-mer length.

> I'm attempting to use Ray (v1.6.1-rc3) and am struggling to find a setting 
> that both proves to finish in reasonable amount
> of time and constructs a reasonable assembly.  I've noticed under the default 
> settings the minimum kmer coverage is set 
> to one less than the peak coverage (which does not appear to be the same as 
> the max coverage

Minimum, peak and repeat coverages are detected in the coverage distribution 
(PREFIX.CoverageDistribution.txt).
The minimum k-mer coverage is not set to one less than peak coverage.


Besides, v1.6.1 is on sourceforge.

> and is often in 500-700 range) and this leads to the exclusion of far too 
> many Kmers (or so it appears), and assembles an 
> awful genome, with n50 in the 100's.
> 

Can you post PREFIX.CoverageDistribution.txt with k=21 on http://pastebin.com/ 
and link it in your next email.

Example:

http://pastebin.com/BJRYwzBZ

The corresponding CoverageDistributionAnalysis.txt file:

k-mer length:   31
Lowest coverage observed:       1
MinimumCoverage:        42
PeakCoverage:   171
RepeatCoverage: 300
Number of k-mers with at least MinimumCoverage: 2453478644 k-mers
Estimated genome length:        1226739322 nucleotides
Percentage of vertices with coverage 1: 83.7745 %
DistributionFile: 
parrot-Testbed-A2-k31-20110719-9c8b02dbd.CoverageDistribution.txt


So, you can see that Ray finds the peak coverage automatically when there is 
one.

> Below is my Kmer distribution:
> 

Why do you only have powers of 2 ?
It is not generated by Ray, Ray measures coverage values from 1 to 65535.


> Kmer coverage bin     Frequency (k=61)        Frequencey (k=31)
> 2     11422   12689
> 4     3461    5764
> 8     2570    5380
> 16    2191    4239
> 32    1753    3382
> 64    1130    2386
> 128   923     1804
> 256   727     1308
> 512   491     954
> 1024  345     684
> 2048  269     487
> 4096  199     375
> 8192  159     260
> 16384         111     188
> 32768         75      137
> 65536         55      92
> 131072        40      67
> 262144        28      46
> 524288        20      33
> 1048576       14      24
> 2097152       10      16
> 4194304       6       11
> 8388608       4       5
> 16777216      3       4
> 33554432      2       4
> 67108864      2       3
> 134217728     2       4
> 268435456     2       5
> 536870912     3       7«
> 1073741824    3       1
> 2147483648    1       0
> 4294967296    0       0
> 8589934592    1       1
> 
> Does anyone have suggestions for Kmer values and coverage minimums to set?
> 

Well, a good start for quality assurance is to run Ray with -k 21 (default 
value) and
look in PREFIX.CoverageDistribution.txt and 
PREFIX.CoverageDistributionAnalysis.txt to see
if a peak was detected. For Illumina data, Ray should do well.

But I don't know if a raw coverage of 13 is enough. Statistically, you will get 
a lot of uncovered regions according
to the Lander-Waterman statistics.

http://en.wikipedia.org/wiki/DNA_sequencing_theory#Lander-Waterman_theory


> Thanks for your help,
> 
> Walter
> 

Best.

Sébastien
http://github.com/sebhtml
------------------------------------------------------------------------------
10 Tips for Better Web Security
Learn 10 ways to better secure your business today. Topics covered include:
Web security, SSL, hacker attacks & Denial of Service (DoS), private keys,
security Microsoft Exchange, secure Instant Messaging, and much more.
http://www.accelacomm.com/jaw/sfnl/114/51426210/
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to