So if it's a suffix tree that's quite a fixed data structure so the chances of developing a pluggable mechanism there would be hard. I think there also has to be a limit as to what we can sensibly do. If people want to contribute this kind of work though then it's all be very well received (with the corresponding test environment/cases of course).
Cheers, Andy On 29 Oct 2010, at 17:56, Mark Fortner wrote: > It might be useful to make the K-mer storage mechanism pluggable. This > would allow a developer to use anything from a simple MultiMap, to a NoSQL > key-value database to store K-mers. You could plugin custom map > implementations to allow you to keep a count of the number of instances of > particular K-mers that were found. It might also be useful to be able to do > set operations on those K-mer collections. You could use it to determine > which K-mers were present in a pathogen and not in a host. > http://www.ncbi.nlm.nih.gov/pubmed/20428334 > http://www.ncbi.nlm.nih.gov/pubmed/16403026 > > Cheers, > > Mark > > card.ly: <http://card.ly/phidias51> > > > On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar <[email protected]>wrote: > >> Hi Andy, >> >> This is good to have. I feel that including it as a part of core may not be >> necessary but having it as part of Genomic module in biojava3 will be nice. >> There is a project Bioinformatica >> >> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >> does something similar although not exactly. It counts the k-mers in a >> given fasta file but it does not count k-mers for each sequence within the >> file, just all within a file. This is a good feature to have specially if >> one is trying to find patterns within sequences which is what I am trying >> to >> do. It would most certainly be helpful to have a k-mer counting algorithm >> that counts k-mer frequency for each sequence. The way to go would be to >> use >> suffix trees. Again I don't know if biojava has a suffix tree api or not >> since I haven't used java in a while and am just switching back to it. A >> paper on using suffix trees to generate genome wide k-mer frequencies is: >> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >> software >> is tallymer). It would be some work to implement this in java as a module >> for biojava3 but I can see that this will be helpful. Again, for small >> fasta >> files, it might not be efficient to create a suffix tree but for bigger >> files, I think that might be the way to go. >> >> Thats just my two cents.What do you think? >> >> -vishal >> >> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <[email protected]> wrote: >> >>> Hi Vishal, >>> >>> As far as I am aware there is nothing which will generate them in BioJava >>> at the moment. However it is possible to do it with BioJava3: >>> >>> public static void main(String[] args) { >>> DNASequence d = new DNASequence("ATGATC"); >>> System.out.println("Non-Overlap"); >>> nonOverlap(d); >>> System.out.println("Overlap"); >>> overlap(d); >>> } >>> >>> public static final int KMER = 3; >>> >>> //Generate triplets overlapping >>> public static void overlap(Sequence<NucleotideCompound> d) { >>> List<WindowedSequence<NucleotideCompound>> l = >>> new ArrayList<WindowedSequence<NucleotideCompound>>(); >>> for(int i=1; i<=KMER; i++) { >>> SequenceView<NucleotideCompound> sub = d.getSubSequence( >>> i, d.getLength()); >>> WindowedSequence<NucleotideCompound> w = >>> new WindowedSequence<NucleotideCompound>(sub, KMER); >>> l.add(w); >>> } >>> >>> //Will return ATG, ATC, TGA & GAT >>> for(WindowedSequence<NucleotideCompound> w: l) { >>> for(List<NucleotideCompound> subList: w) { >>> System.out.println(subList); >>> } >>> } >>> } >>> >>> //Generate triplet Compound lists non-overlapping >>> public static void nonOverlap(Sequence<NucleotideCompound> d) { >>> WindowedSequence<NucleotideCompound> w = >>> new WindowedSequence<NucleotideCompound>(d, KMER); >>> //Will return ATG & ATC >>> for(List<NucleotideCompound> subList: w) { >>> System.out.println(subList); >>> } >>> } >>> >>> The disadvantage of all of these solutions is that they generate lists of >>> Compounds so kmer generation can/will be a memory intensive operation. >> This >>> does mean it has to be since sub sequences are thin wrappers around an >>> underlying sequence. Also the overlap solution is non-optimal since it >>> iterates through each window rather than stepping through delegating onto >>> each base in turn (hence why we get ATG & ATC before TGA) >>> >>> As for unique k-mers that's something which would require a bit more >>> engineering & would be better suited to a solution built around a Trie >>> (prefix tree). >>> >>> Hope this helps, >>> >>> Andy >>> >>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>> >>>> Hi All, >>>> >>>> I had a quick question: Does Biojava have a method to generate k-mers >> or >>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >> k-mer >>>> counts for every sequence in a fasta file. If something like this >> exists >>> it >>>> would save me some time to write the code. >>>> >>>> Thanks, >>>> >>>> Vishal >>>> _______________________________________________ >>>> Biojava-l mailing list - [email protected] >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> -- >>> Andrew Yates Ensembl Genomes Engineer >>> EMBL-EBI Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>> >>> >>> >>> >>> >> >> >> -- >> *Vishal Thapar, Ph.D.* >> *Scientific informatics Analyst >> Cold Spring Harbor Lab >> Quick Bldg, Lowe Lab >> 1 Bungtown Road >> Cold Spring Harbor, NY - 11724* >> _______________________________________________ >> Biojava-l mailing list - [email protected] >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > _______________________________________________ > Biojava-l mailing list - [email protected] > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
