to directly to that Rick said: This also means that in a heavily skewed dataset (e.g. insufficient rRNA removal, had that recently), a lower k-mer will produce more and longer contigs than a higher k-mer value.
Unrelated: Sometimes it's a matter of trial and error to see what fits. In my very last dataset, a k-mer of 31 gave good results, 77 bad results, 125 again good results. Don't ask me why exactly. ...and also another thing to consider: Define "good". More and longer contigs? Yeah, that can happen with a low k-mer size...because contigs are more easily joined incorrectly (unless I'm totally wrong, but I've seen that too). ________________________________________ From: Rick Westerman [wester...@purdue.edu] Sent: Wednesday, February 12, 2014 10:42 PM To: Nathaniel Jue Cc: denovoassembler-users@lists.sourceforge.net Subject: Re: [Denovoassembler-users] Decreasing assembly quality with increasing kmer values A peak in effective kmer size for a quality assembly is common in almost all assembly programs. I see it in ABySS, in Ray, in SOAPdenovo. This is not to say that you are doing everything correctly -- you may be making an error since you are suspiciously peaking out at your memory constraints -- but rather it is to say that longer kmers simply may not be useful. In my current project Ray's best assembly is at kmer41. It is worse at lower kmers and higher kmers. There is no hard-and-fast rule. The above is the important point. Below is a long winded explanation which I am sure that other people could do much better than I. I recall reading in the Ray manual that longer kmers are simply too divergent to be useful but can not find that passage at the moment. Let's see if I can explain -- if nothing else the practice is good for me. Basically for any given kmer the assembly programs are looking for a peak number of reads versus occurrence. This is roughly the depth of coverage and will be used for further refinement. As an example, for kmer41 I might get: kmers occurring 1 time: 20 [basically noise; in the ideal world] kmers occurring 2 times: 30 kmers occurring 3 times: 20 ... kmers occurring 20 times: 300 kmers occurring 21 times: 2,500 kmers occurring 22 times: 9,000 kmers occurring 23 times: 7,500 kmers occurring 24 times: 1,500 ... Thus my peak is about 22 times. For a genome assembly program it is assumed that the entire genome will be at this coverage. The assembly program can then throw away -- or treat differently -- the parts of reads that have kmers in the low occurrence category (these are due to sequencing error or random noise) as well as those with high occurrence (these are repeat areas). In the above example the assembly program might discards reads with kmers outside the '10 to 40' range. The above concept works well if there are no sequencing errors. But of course there are thus kmers with errors become unique and we see the initial part of the occurrence list look more like : kmers occurring 1 time: 10,000 kmers occurring 2 times: 3,000 kmers occurring 3 times: 500 ... As we increase the kmer length then the number of kmers with an error in them, and thus unique, increases. More and more kmers get put into the 'occurring 1 time' (or 2 times) category and are thus discarded or at least not used in the initial stages of assembly. Additionally the peak gets flattened and harder to determine. Thus we want want to balance kmer size with the need not to have too short of a kmer. As an example, let's say that sequencing error is 1 in 100 bases (with some distribution function). If read sizes are 100 bases and the kmer size is 31 then There will be 69 kmers from this reads. What is the chance that any given kmer will have an error in it and thus be unique? It does depend on the distribution but we can say (very!) roughly about 25% (if I have that right). How about if the kmer size is 90? Then roughly 90% of the kmers will have an error and thus be unique. At this point we might as well not try to assemble anything. Bottom line: longer kmers are not always better. ----- Original Message ----- > Just wondering if you might be able to offer some insight as to what > might be going with our runs of ray. We are assembly a large > eukaryotic genome and have been using both Ray and Celera to do so. > Initially, Ray gave us some great results and increasing kmer values > up to 31-kmer readily improved our assemblies. We then ran into a > memory wall and got some folks to help us out running things off site. > We were able to run a 33-kmer run on the same set, but it looked worse > than the 31-kmer run. Additionally, longer runs of say 61-kmers have > looked even worse. Genome should be around 30-40x coverage, I think, > with Hi-seq and Mi-seq data (2x150 and 2x250 paired-end libraries, > each heavily trimmed to get rid of anything suspicious) and we thought > we'd still have some room to increase kmer sizes >31 to improve the > assembly. Any thoughts on what might be going on here? Did we possibly > compile the program incorrectly for larger kmer sizes? Ever see > something similar? > > Thanks in advance for any thoughts on the matter. > > Cheers, > Nate -- Rick Westerman wester...@purdue.edu Bioinformatics specialist at the Genomics Facility. Phone: (765) 494-0505 FAX: (765) 496-7255 Department of Horticulture and Landscape Architecture 625 Agriculture Mall Drive West Lafayette, IN 47907-2010 Physically located in room S049, WSLR building ------------------------------------------------------------------------------ Android apps run on BlackBerry 10 Introducing the new BlackBerry 10.2.1 Runtime for Android apps. Now with support for Jelly Bean, Bluetooth, Mapview and more. Get your Android app in front of a whole new audience. Start now. http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users ------------------------------------------------------------------------------ Android apps run on BlackBerry 10 Introducing the new BlackBerry 10.2.1 Runtime for Android apps. Now with support for Jelly Bean, Bluetooth, Mapview and more. Get your Android app in front of a whole new audience. Start now. http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users