It might be useful to make the K-mer storage mechanism pluggable. This would allow a developer to use anything from a simple MultiMap, to a NoSQL key-value database to store K-mers. You could plugin custom map implementations to allow you to keep a count of the number of instances of particular K-mers that were found. It might also be useful to be able to do set operations on those K-mer collections. You could use it to determine which K-mers were present in a pathogen and not in a host. http://www.ncbi.nlm.nih.gov/pubmed/20428334 http://www.ncbi.nlm.nih.gov/pubmed/16403026
Cheers, Mark card.ly: <http://card.ly/phidias51> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar <[email protected]>wrote: > Hi Andy, > > This is good to have. I feel that including it as a part of core may not be > necessary but having it as part of Genomic module in biojava3 will be nice. > There is a project Bioinformatica > > http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich > does something similar although not exactly. It counts the k-mers in a > given fasta file but it does not count k-mers for each sequence within the > file, just all within a file. This is a good feature to have specially if > one is trying to find patterns within sequences which is what I am trying > to > do. It would most certainly be helpful to have a k-mer counting algorithm > that counts k-mer frequency for each sequence. The way to go would be to > use > suffix trees. Again I don't know if biojava has a suffix tree api or not > since I haven't used java in a while and am just switching back to it. A > paper on using suffix trees to generate genome wide k-mer frequencies is: > http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, > software > is tallymer). It would be some work to implement this in java as a module > for biojava3 but I can see that this will be helpful. Again, for small > fasta > files, it might not be efficient to create a suffix tree but for bigger > files, I think that might be the way to go. > > Thats just my two cents.What do you think? > > -vishal > > On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <[email protected]> wrote: > > > Hi Vishal, > > > > As far as I am aware there is nothing which will generate them in BioJava > > at the moment. However it is possible to do it with BioJava3: > > > > public static void main(String[] args) { > > DNASequence d = new DNASequence("ATGATC"); > > System.out.println("Non-Overlap"); > > nonOverlap(d); > > System.out.println("Overlap"); > > overlap(d); > > } > > > > public static final int KMER = 3; > > > > //Generate triplets overlapping > > public static void overlap(Sequence<NucleotideCompound> d) { > > List<WindowedSequence<NucleotideCompound>> l = > > new ArrayList<WindowedSequence<NucleotideCompound>>(); > > for(int i=1; i<=KMER; i++) { > > SequenceView<NucleotideCompound> sub = d.getSubSequence( > > i, d.getLength()); > > WindowedSequence<NucleotideCompound> w = > > new WindowedSequence<NucleotideCompound>(sub, KMER); > > l.add(w); > > } > > > > //Will return ATG, ATC, TGA & GAT > > for(WindowedSequence<NucleotideCompound> w: l) { > > for(List<NucleotideCompound> subList: w) { > > System.out.println(subList); > > } > > } > > } > > > > //Generate triplet Compound lists non-overlapping > > public static void nonOverlap(Sequence<NucleotideCompound> d) { > > WindowedSequence<NucleotideCompound> w = > > new WindowedSequence<NucleotideCompound>(d, KMER); > > //Will return ATG & ATC > > for(List<NucleotideCompound> subList: w) { > > System.out.println(subList); > > } > > } > > > > The disadvantage of all of these solutions is that they generate lists of > > Compounds so kmer generation can/will be a memory intensive operation. > This > > does mean it has to be since sub sequences are thin wrappers around an > > underlying sequence. Also the overlap solution is non-optimal since it > > iterates through each window rather than stepping through delegating onto > > each base in turn (hence why we get ATG & ATC before TGA) > > > > As for unique k-mers that's something which would require a bit more > > engineering & would be better suited to a solution built around a Trie > > (prefix tree). > > > > Hope this helps, > > > > Andy > > > > On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > > > > > Hi All, > > > > > > I had a quick question: Does Biojava have a method to generate k-mers > or > > > K-mer counting in a given Fasta Sequence / File? Basically, I want > k-mer > > > counts for every sequence in a fasta file. If something like this > exists > > it > > > would save me some time to write the code. > > > > > > Thanks, > > > > > > Vishal > > > _______________________________________________ > > > Biojava-l mailing list - [email protected] > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > > Andrew Yates Ensembl Genomes Engineer > > EMBL-EBI Tel: +44-(0)1223-492538 > > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > > > > > > > > -- > *Vishal Thapar, Ph.D.* > *Scientific informatics Analyst > Cold Spring Harbor Lab > Quick Bldg, Lowe Lab > 1 Bungtown Road > Cold Spring Harbor, NY - 11724* > _______________________________________________ > Biojava-l mailing list - [email protected] > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
