Re: [Denovoassembler-users] Decreasing assembly quality with increasing kmer values

Rick Westerman Wed, 12 Feb 2014 13:55:27 -0800

   A peak in effective kmer size for a quality assembly is common in almost all 
assembly programs.  I see it in ABySS, in Ray, in SOAPdenovo.  This is not to 
say that you are doing everything correctly -- you may be making an error since 
you are suspiciously peaking out at your memory constraints -- but rather it is 
to say that longer kmers simply may not be useful.  In my current project Ray's 
best assembly is at kmer41.  It is worse at lower kmers and higher kmers.  
There is no hard-and-fast rule.
    
   The above is the important point. Below is a long winded explanation which I 
am sure that other people could do much better than I.


   I recall reading in the Ray manual that longer kmers are simply too 
divergent to be useful but can not find that passage at the moment.  Let's see 
if I can explain -- if nothing else the practice is good for me.  Basically for 
any given kmer the assembly programs are looking for a peak number of reads 
versus occurrence.  This is roughly the depth of coverage and will be used for 
further refinement.  As an example, for kmer41 I might get:

kmers occurring 1 time:      20   [basically noise; in the ideal world]
kmers occurring 2 times:     30
kmers occurring 3 times:     20
...
kmers occurring 20 times:   300
kmers occurring 21 times: 2,500
kmers occurring 22 times: 9,000
kmers occurring 23 times: 7,500
kmers occurring 24 times: 1,500
...

Thus my peak is about 22 times.   For a genome assembly program it is assumed 
that the entire genome will be at this coverage.    The assembly program can 
then throw away -- or treat differently -- the parts of reads that have kmers 
in the low occurrence category (these are due to sequencing error or random 
noise) as well as those with high occurrence (these are repeat areas).   In the 
above example the assembly program might discards reads with kmers outside the 
'10 to 40' range.

The above concept works well if there are no sequencing errors.  But of course 
there are thus kmers with errors become unique and we see the initial part of 
the occurrence list look more like :

kmers occurring 1 time:  10,000  
kmers occurring 2 times:  3,000
kmers occurring 3 times:    500
...

As we increase the kmer length then the number of kmers with an error in them, 
and thus unique, increases. More and more kmers get put into the 'occurring 1 
time' (or 2 times) category and are thus discarded or at least not used in the 
initial stages of assembly.  Additionally the peak gets flattened and harder to 
determine.   Thus we want want to balance kmer size with the need not to have 
too short of a kmer.  

As an example, let's say that sequencing error is 1 in 100 bases (with some 
distribution function).  If read sizes are 100 bases and the kmer size is 31 
then There will be 69 kmers from this reads. What is the chance that any given 
kmer will have an error in it and thus be unique?  It does depend on the 
distribution but we can say (very!) roughly about 25% (if I have that right).   
How about if the kmer size is 90?  Then roughly 90% of the kmers will have an 
error and thus be unique.  At this point we might as well not try to assemble 
anything. 

Bottom line: longer kmers are not always better.  


----- Original Message -----
> Just wondering if you might be able to offer some insight as to what
> might be going with our runs of ray. We are assembly a large
> eukaryotic genome and have been using both Ray and Celera to do so.
> Initially, Ray gave us some great results and increasing kmer values
> up to 31-kmer readily improved our assemblies. We then ran into a
> memory wall and got some folks to help us out running things off site.
> We were able to run a 33-kmer run on the same set, but it looked worse
> than the 31-kmer run. Additionally, longer runs of say 61-kmers have
> looked even worse. Genome should be around 30-40x coverage, I think,
> with Hi-seq and Mi-seq data (2x150 and 2x250 paired-end libraries,
> each heavily trimmed to get rid of anything suspicious) and we thought
> we'd still have some room to increase kmer sizes >31 to improve the
> assembly. Any thoughts on what might be going on here? Did we possibly
> compile the program incorrectly for larger kmer sizes? Ever see
> something similar?
> 
> Thanks in advance for any thoughts on the matter.
> 
> Cheers,
> Nate

-- 
Rick Westerman 
[email protected]

Bioinformatics specialist at the Genomics Facility.
Phone: (765) 494-0505           FAX: (765) 496-7255
Department of Horticulture and Landscape Architecture
625 Agriculture Mall Drive
West Lafayette, IN 47907-2010
Physically located in room S049, WSLR building

------------------------------------------------------------------------------
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Decreasing assembly quality with increasing kmer values

Reply via email to