A peak in effective kmer size for a quality assembly is common in almost all
assembly programs. I see it in ABySS, in Ray, in SOAPdenovo. This is not to
say that you are doing everything correctly -- you may be making an error since
you are suspiciously peaking out at your memory constraints -- but rather it is
to say that longer kmers simply may not be useful. In my current project Ray's
best assembly is at kmer41. It is worse at lower kmers and higher kmers.
There is no hard-and-fast rule.
The above is the important point. Below is a long winded explanation which I
am sure that other people could do much better than I.
I recall reading in the Ray manual that longer kmers are simply too
divergent to be useful but can not find that passage at the moment. Let's see
if I can explain -- if nothing else the practice is good for me. Basically for
any given kmer the assembly programs are looking for a peak number of reads
versus occurrence. This is roughly the depth of coverage and will be used for
further refinement. As an example, for kmer41 I might get:
kmers occurring 1 time: 20 [basically noise; in the ideal world]
kmers occurring 2 times: 30
kmers occurring 3 times: 20
...
kmers occurring 20 times: 300
kmers occurring 21 times: 2,500
kmers occurring 22 times: 9,000
kmers occurring 23 times: 7,500
kmers occurring 24 times: 1,500
...
Thus my peak is about 22 times. For a genome assembly program it is assumed
that the entire genome will be at this coverage. The assembly program can
then throw away -- or treat differently -- the parts of reads that have kmers
in the low occurrence category (these are due to sequencing error or random
noise) as well as those with high occurrence (these are repeat areas). In the
above example the assembly program might discards reads with kmers outside the
'10 to 40' range.
The above concept works well if there are no sequencing errors. But of course
there are thus kmers with errors become unique and we see the initial part of
the occurrence list look more like :
kmers occurring 1 time: 10,000
kmers occurring 2 times: 3,000
kmers occurring 3 times: 500
...
As we increase the kmer length then the number of kmers with an error in them,
and thus unique, increases. More and more kmers get put into the 'occurring 1
time' (or 2 times) category and are thus discarded or at least not used in the
initial stages of assembly. Additionally the peak gets flattened and harder to
determine. Thus we want want to balance kmer size with the need not to have
too short of a kmer.
As an example, let's say that sequencing error is 1 in 100 bases (with some
distribution function). If read sizes are 100 bases and the kmer size is 31
then There will be 69 kmers from this reads. What is the chance that any given
kmer will have an error in it and thus be unique? It does depend on the
distribution but we can say (very!) roughly about 25% (if I have that right).
How about if the kmer size is 90? Then roughly 90% of the kmers will have an
error and thus be unique. At this point we might as well not try to assemble
anything.
Bottom line: longer kmers are not always better.
----- Original Message -----
> Just wondering if you might be able to offer some insight as to what
> might be going with our runs of ray. We are assembly a large
> eukaryotic genome and have been using both Ray and Celera to do so.
> Initially, Ray gave us some great results and increasing kmer values
> up to 31-kmer readily improved our assemblies. We then ran into a
> memory wall and got some folks to help us out running things off site.
> We were able to run a 33-kmer run on the same set, but it looked worse
> than the 31-kmer run. Additionally, longer runs of say 61-kmers have
> looked even worse. Genome should be around 30-40x coverage, I think,
> with Hi-seq and Mi-seq data (2x150 and 2x250 paired-end libraries,
> each heavily trimmed to get rid of anything suspicious) and we thought
> we'd still have some room to increase kmer sizes >31 to improve the
> assembly. Any thoughts on what might be going on here? Did we possibly
> compile the program incorrectly for larger kmer sizes? Ever see
> something similar?
>
> Thanks in advance for any thoughts on the matter.
>
> Cheers,
> Nate
--
Rick Westerman
[email protected]
Bioinformatics specialist at the Genomics Facility.
Phone: (765) 494-0505 FAX: (765) 496-7255
Department of Horticulture and Landscape Architecture
625 Agriculture Mall Drive
West Lafayette, IN 47907-2010
Physically located in room S049, WSLR building
------------------------------------------------------------------------------
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience. Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users