That is good news.Thanks for the directions Andy. I have already started on this.Let me analyze and write the code now.
Maybe a next month deadline is not unreachable in this case. Here we go! JD On 10/30/10, Andy Yates <[email protected]> wrote: > So we've got some basic kmer work now in SVN. If you look in the class > SequenceMixin there are two static methods there for generating the two > types of k-mers. It's not developed with Map storage in mind & I'll leave > the door open there for anyone else to come in & develop it. The k-mers are > also not unique across the sequence but it's a start :) > > Share & enjoy! > > Andy > > On 29 Oct 2010, at 19:50, jitesh dundas wrote: > >> I agree Andy. These have become standard functionalities that >> scientists do these days. I am all for implementing that in BioJava3. >> Java isn't that efficient for such functionalities so we will surely >> need more effort compared to the same in Python/Perl. >> >> Regards, >> Jitesh Dundas >> >> On 10/30/10, Andy Yates <[email protected]> wrote: >>> So if it's a suffix tree that's quite a fixed data structure so the >>> chances >>> of developing a pluggable mechanism there would be hard. I think there >>> also >>> has to be a limit as to what we can sensibly do. If people want to >>> contribute this kind of work though then it's all be very well received >>> (with the corresponding test environment/cases of course). >>> >>> Cheers, >>> >>> Andy >>> >>> On 29 Oct 2010, at 17:56, Mark Fortner wrote: >>> >>>> It might be useful to make the K-mer storage mechanism pluggable. This >>>> would allow a developer to use anything from a simple MultiMap, to a >>>> NoSQL >>>> key-value database to store K-mers. You could plugin custom map >>>> implementations to allow you to keep a count of the number of instances >>>> of >>>> particular K-mers that were found. It might also be useful to be able >>>> to >>>> do >>>> set operations on those K-mer collections. You could use it to >>>> determine >>>> which K-mers were present in a pathogen and not in a host. >>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334 >>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026 >>>> >>>> Cheers, >>>> >>>> Mark >>>> >>>> card.ly: <http://card.ly/phidias51> >>>> >>>> >>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar >>>> <[email protected]>wrote: >>>> >>>>> Hi Andy, >>>>> >>>>> This is good to have. I feel that including it as a part of core may >>>>> not >>>>> be >>>>> necessary but having it as part of Genomic module in biojava3 will be >>>>> nice. >>>>> There is a project Bioinformatica >>>>> >>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >>>>> does something similar although not exactly. It counts the k-mers in a >>>>> given fasta file but it does not count k-mers for each sequence within >>>>> the >>>>> file, just all within a file. This is a good feature to have specially >>>>> if >>>>> one is trying to find patterns within sequences which is what I am >>>>> trying >>>>> to >>>>> do. It would most certainly be helpful to have a k-mer counting >>>>> algorithm >>>>> that counts k-mer frequency for each sequence. The way to go would be >>>>> to >>>>> use >>>>> suffix trees. Again I don't know if biojava has a suffix tree api or >>>>> not >>>>> since I haven't used java in a while and am just switching back to it. >>>>> A >>>>> paper on using suffix trees to generate genome wide k-mer frequencies >>>>> is: >>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >>>>> software >>>>> is tallymer). It would be some work to implement this in java as a >>>>> module >>>>> for biojava3 but I can see that this will be helpful. Again, for small >>>>> fasta >>>>> files, it might not be efficient to create a suffix tree but for bigger >>>>> files, I think that might be the way to go. >>>>> >>>>> Thats just my two cents.What do you think? >>>>> >>>>> -vishal >>>>> >>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <[email protected]> wrote: >>>>> >>>>>> Hi Vishal, >>>>>> >>>>>> As far as I am aware there is nothing which will generate them in >>>>>> BioJava >>>>>> at the moment. However it is possible to do it with BioJava3: >>>>>> >>>>>> public static void main(String[] args) { >>>>>> DNASequence d = new DNASequence("ATGATC"); >>>>>> System.out.println("Non-Overlap"); >>>>>> nonOverlap(d); >>>>>> System.out.println("Overlap"); >>>>>> overlap(d); >>>>>> } >>>>>> >>>>>> public static final int KMER = 3; >>>>>> >>>>>> //Generate triplets overlapping >>>>>> public static void overlap(Sequence<NucleotideCompound> d) { >>>>>> List<WindowedSequence<NucleotideCompound>> l = >>>>>> new ArrayList<WindowedSequence<NucleotideCompound>>(); >>>>>> for(int i=1; i<=KMER; i++) { >>>>>> SequenceView<NucleotideCompound> sub = d.getSubSequence( >>>>>> i, d.getLength()); >>>>>> WindowedSequence<NucleotideCompound> w = >>>>>> new WindowedSequence<NucleotideCompound>(sub, KMER); >>>>>> l.add(w); >>>>>> } >>>>>> >>>>>> //Will return ATG, ATC, TGA & GAT >>>>>> for(WindowedSequence<NucleotideCompound> w: l) { >>>>>> for(List<NucleotideCompound> subList: w) { >>>>>> System.out.println(subList); >>>>>> } >>>>>> } >>>>>> } >>>>>> >>>>>> //Generate triplet Compound lists non-overlapping >>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) { >>>>>> WindowedSequence<NucleotideCompound> w = >>>>>> new WindowedSequence<NucleotideCompound>(d, KMER); >>>>>> //Will return ATG & ATC >>>>>> for(List<NucleotideCompound> subList: w) { >>>>>> System.out.println(subList); >>>>>> } >>>>>> } >>>>>> >>>>>> The disadvantage of all of these solutions is that they generate lists >>>>>> of >>>>>> Compounds so kmer generation can/will be a memory intensive operation. >>>>> This >>>>>> does mean it has to be since sub sequences are thin wrappers around an >>>>>> underlying sequence. Also the overlap solution is non-optimal since it >>>>>> iterates through each window rather than stepping through delegating >>>>>> onto >>>>>> each base in turn (hence why we get ATG & ATC before TGA) >>>>>> >>>>>> As for unique k-mers that's something which would require a bit more >>>>>> engineering & would be better suited to a solution built around a Trie >>>>>> (prefix tree). >>>>>> >>>>>> Hope this helps, >>>>>> >>>>>> Andy >>>>>> >>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> I had a quick question: Does Biojava have a method to generate k-mers >>>>> or >>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >>>>> k-mer >>>>>>> counts for every sequence in a fasta file. If something like this >>>>> exists >>>>>> it >>>>>>> would save me some time to write the code. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Vishal >>>>>>> _______________________________________________ >>>>>>> Biojava-l mailing list - [email protected] >>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>> >>>>>> -- >>>>>> Andrew Yates Ensembl Genomes Engineer >>>>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Vishal Thapar, Ph.D.* >>>>> *Scientific informatics Analyst >>>>> Cold Spring Harbor Lab >>>>> Quick Bldg, Lowe Lab >>>>> 1 Bungtown Road >>>>> Cold Spring Harbor, NY - 11724* >>>>> _______________________________________________ >>>>> Biojava-l mailing list - [email protected] >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>>> _______________________________________________ >>>> Biojava-l mailing list - [email protected] >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> -- >>> Andrew Yates Ensembl Genomes Engineer >>> EMBL-EBI Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Biojava-l mailing list - [email protected] >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
