For reference, Jellyfish is supposed to be state of the art for fast k-mer counting http://www.cbcb.umd.edu/software/jellyfish/
Kasper On Thu, Feb 7, 2013 at 6:51 PM, Hervé Pagès <hpa...@fhcrc.org> wrote: > Hi Dario, > > > On 02/05/2013 05:00 PM, Dario Strbenac wrote: >> >> Hello, >> >> Would it be possible to include an option that firstly goes through all of >> the strings and runs a sliding window along them, to find all the unique >> k-mers present in the dataset ? > > > Finding the unique k-mers in the dataset can easily be done with: > > library(Biostrings) > > uniqueOligonucleotides <- function(x, width) > { > collapsed_freq <- oligonucleotideFrequency(x, width, > simplify.as="collapsed") > names(collapsed_freq)[which(collapsed_freq != 0L)] > > } > >> This would avoid having a sparse matrix with many columns of all zero >> counts, when a larger value of width is specified. > > > Sounds like a useful addition. Maybe we could support this thru > a 'drop' arg. When 'drop' is TRUE, it would do something like > this (building on top of uniqueOligonucleotides() and vcountPDict()): > > oligonucleotideFrequency2 <- function(x, width) > { > kmers <- uniqueOligonucleotides(x, width) > pdict <- PDict(kmers) > ans <- t(vcountPDict(pdict, x)) > colnames(ans) <- kmers > ans > } > > Then: > > > library(hgu95av2probe) > > probes <- DNAStringSet(hgu95av2probe) > > > dim(freq1 <- oligonucleotideFrequency(head(probes), 5)) > [1] 6 1024 > > > dim(freq2 <- oligonucleotideFrequency2(head(probes), 5)) > [1] 6 99 > > > identical(freq2, freq1[ , colnames(freq2)]) > [1] TRUE > > > all(freq1[ , setdiff(colnames(freq1), colnames(freq2))] == 0L) > [1] TRUE > > Added to my TODO list. > > Thanks, > > H. > >> >> -------------------------------------- >> Dario Strbenac >> PhD Student >> University of Sydney >> Camperdown NSW 2050 >> Australia >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel