So we've got some basic kmer work now in SVN. If you look in the class SequenceMixin there are two static methods there for generating the two types of k-mers. It's not developed with Map storage in mind & I'll leave the door open there for anyone else to come in & develop it. The k-mers are also not unique across the sequence but it's a start :)
Share & enjoy! Andy On 29 Oct 2010, at 19:50, jitesh dundas wrote: > I agree Andy. These have become standard functionalities that > scientists do these days. I am all for implementing that in BioJava3. > Java isn't that efficient for such functionalities so we will surely > need more effort compared to the same in Python/Perl. > > Regards, > Jitesh Dundas > > On 10/30/10, Andy Yates <[email protected]> wrote: >> So if it's a suffix tree that's quite a fixed data structure so the chances >> of developing a pluggable mechanism there would be hard. I think there also >> has to be a limit as to what we can sensibly do. If people want to >> contribute this kind of work though then it's all be very well received >> (with the corresponding test environment/cases of course). >> >> Cheers, >> >> Andy >> >> On 29 Oct 2010, at 17:56, Mark Fortner wrote: >> >>> It might be useful to make the K-mer storage mechanism pluggable. This >>> would allow a developer to use anything from a simple MultiMap, to a NoSQL >>> key-value database to store K-mers. You could plugin custom map >>> implementations to allow you to keep a count of the number of instances of >>> particular K-mers that were found. It might also be useful to be able to >>> do >>> set operations on those K-mer collections. You could use it to determine >>> which K-mers were present in a pathogen and not in a host. >>> http://www.ncbi.nlm.nih.gov/pubmed/20428334 >>> http://www.ncbi.nlm.nih.gov/pubmed/16403026 >>> >>> Cheers, >>> >>> Mark >>> >>> card.ly: <http://card.ly/phidias51> >>> >>> >>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar >>> <[email protected]>wrote: >>> >>>> Hi Andy, >>>> >>>> This is good to have. I feel that including it as a part of core may not >>>> be >>>> necessary but having it as part of Genomic module in biojava3 will be >>>> nice. >>>> There is a project Bioinformatica >>>> >>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >>>> does something similar although not exactly. It counts the k-mers in a >>>> given fasta file but it does not count k-mers for each sequence within >>>> the >>>> file, just all within a file. This is a good feature to have specially if >>>> one is trying to find patterns within sequences which is what I am trying >>>> to >>>> do. It would most certainly be helpful to have a k-mer counting algorithm >>>> that counts k-mer frequency for each sequence. The way to go would be to >>>> use >>>> suffix trees. Again I don't know if biojava has a suffix tree api or not >>>> since I haven't used java in a while and am just switching back to it. A >>>> paper on using suffix trees to generate genome wide k-mer frequencies is: >>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >>>> software >>>> is tallymer). It would be some work to implement this in java as a module >>>> for biojava3 but I can see that this will be helpful. Again, for small >>>> fasta >>>> files, it might not be efficient to create a suffix tree but for bigger >>>> files, I think that might be the way to go. >>>> >>>> Thats just my two cents.What do you think? >>>> >>>> -vishal >>>> >>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <[email protected]> wrote: >>>> >>>>> Hi Vishal, >>>>> >>>>> As far as I am aware there is nothing which will generate them in >>>>> BioJava >>>>> at the moment. However it is possible to do it with BioJava3: >>>>> >>>>> public static void main(String[] args) { >>>>> DNASequence d = new DNASequence("ATGATC"); >>>>> System.out.println("Non-Overlap"); >>>>> nonOverlap(d); >>>>> System.out.println("Overlap"); >>>>> overlap(d); >>>>> } >>>>> >>>>> public static final int KMER = 3; >>>>> >>>>> //Generate triplets overlapping >>>>> public static void overlap(Sequence<NucleotideCompound> d) { >>>>> List<WindowedSequence<NucleotideCompound>> l = >>>>> new ArrayList<WindowedSequence<NucleotideCompound>>(); >>>>> for(int i=1; i<=KMER; i++) { >>>>> SequenceView<NucleotideCompound> sub = d.getSubSequence( >>>>> i, d.getLength()); >>>>> WindowedSequence<NucleotideCompound> w = >>>>> new WindowedSequence<NucleotideCompound>(sub, KMER); >>>>> l.add(w); >>>>> } >>>>> >>>>> //Will return ATG, ATC, TGA & GAT >>>>> for(WindowedSequence<NucleotideCompound> w: l) { >>>>> for(List<NucleotideCompound> subList: w) { >>>>> System.out.println(subList); >>>>> } >>>>> } >>>>> } >>>>> >>>>> //Generate triplet Compound lists non-overlapping >>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) { >>>>> WindowedSequence<NucleotideCompound> w = >>>>> new WindowedSequence<NucleotideCompound>(d, KMER); >>>>> //Will return ATG & ATC >>>>> for(List<NucleotideCompound> subList: w) { >>>>> System.out.println(subList); >>>>> } >>>>> } >>>>> >>>>> The disadvantage of all of these solutions is that they generate lists >>>>> of >>>>> Compounds so kmer generation can/will be a memory intensive operation. >>>> This >>>>> does mean it has to be since sub sequences are thin wrappers around an >>>>> underlying sequence. Also the overlap solution is non-optimal since it >>>>> iterates through each window rather than stepping through delegating >>>>> onto >>>>> each base in turn (hence why we get ATG & ATC before TGA) >>>>> >>>>> As for unique k-mers that's something which would require a bit more >>>>> engineering & would be better suited to a solution built around a Trie >>>>> (prefix tree). >>>>> >>>>> Hope this helps, >>>>> >>>>> Andy >>>>> >>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I had a quick question: Does Biojava have a method to generate k-mers >>>> or >>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >>>> k-mer >>>>>> counts for every sequence in a fasta file. If something like this >>>> exists >>>>> it >>>>>> would save me some time to write the code. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Vishal >>>>>> _______________________________________________ >>>>>> Biojava-l mailing list - [email protected] >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>>>> -- >>>>> Andrew Yates Ensembl Genomes Engineer >>>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> *Vishal Thapar, Ph.D.* >>>> *Scientific informatics Analyst >>>> Cold Spring Harbor Lab >>>> Quick Bldg, Lowe Lab >>>> 1 Bungtown Road >>>> Cold Spring Harbor, NY - 11724* >>>> _______________________________________________ >>>> Biojava-l mailing list - [email protected] >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>> _______________________________________________ >>> Biojava-l mailing list - [email protected] >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> >> >> _______________________________________________ >> Biojava-l mailing list - [email protected] >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
