You should be aware I just found a bug in the code. This has been fixed but the bug will still be in the alpha3 release. I would recommend either building a version yourself or if Andreas can post up the continuous integration server address there will be a release tonight.
Just goes to show you should always do more testing than you think :). Andy On 29 Oct 2010, at 20:43, jitesh dundas wrote: > That is good news.Thanks for the directions Andy. > > I have already started on this.Let me analyze and write the code now. > > Maybe a next month deadline is not unreachable in this case. > > Here we go! > JD > > On 10/30/10, Andy Yates <[email protected]> wrote: >> So we've got some basic kmer work now in SVN. If you look in the class >> SequenceMixin there are two static methods there for generating the two >> types of k-mers. It's not developed with Map storage in mind & I'll leave >> the door open there for anyone else to come in & develop it. The k-mers are >> also not unique across the sequence but it's a start :) >> >> Share & enjoy! >> >> Andy >> >> On 29 Oct 2010, at 19:50, jitesh dundas wrote: >> >>> I agree Andy. These have become standard functionalities that >>> scientists do these days. I am all for implementing that in BioJava3. >>> Java isn't that efficient for such functionalities so we will surely >>> need more effort compared to the same in Python/Perl. >>> >>> Regards, >>> Jitesh Dundas >>> >>> On 10/30/10, Andy Yates <[email protected]> wrote: >>>> So if it's a suffix tree that's quite a fixed data structure so the >>>> chances >>>> of developing a pluggable mechanism there would be hard. I think there >>>> also >>>> has to be a limit as to what we can sensibly do. If people want to >>>> contribute this kind of work though then it's all be very well received >>>> (with the corresponding test environment/cases of course). >>>> >>>> Cheers, >>>> >>>> Andy >>>> >>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote: >>>> >>>>> It might be useful to make the K-mer storage mechanism pluggable. This >>>>> would allow a developer to use anything from a simple MultiMap, to a >>>>> NoSQL >>>>> key-value database to store K-mers. You could plugin custom map >>>>> implementations to allow you to keep a count of the number of instances >>>>> of >>>>> particular K-mers that were found. It might also be useful to be able >>>>> to >>>>> do >>>>> set operations on those K-mer collections. You could use it to >>>>> determine >>>>> which K-mers were present in a pathogen and not in a host. >>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334 >>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026 >>>>> >>>>> Cheers, >>>>> >>>>> Mark >>>>> >>>>> card.ly: <http://card.ly/phidias51> >>>>> >>>>> >>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar >>>>> <[email protected]>wrote: >>>>> >>>>>> Hi Andy, >>>>>> >>>>>> This is good to have. I feel that including it as a part of core may >>>>>> not >>>>>> be >>>>>> necessary but having it as part of Genomic module in biojava3 will be >>>>>> nice. >>>>>> There is a project Bioinformatica >>>>>> >>>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >>>>>> does something similar although not exactly. It counts the k-mers in a >>>>>> given fasta file but it does not count k-mers for each sequence within >>>>>> the >>>>>> file, just all within a file. This is a good feature to have specially >>>>>> if >>>>>> one is trying to find patterns within sequences which is what I am >>>>>> trying >>>>>> to >>>>>> do. It would most certainly be helpful to have a k-mer counting >>>>>> algorithm >>>>>> that counts k-mer frequency for each sequence. The way to go would be >>>>>> to >>>>>> use >>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or >>>>>> not >>>>>> since I haven't used java in a while and am just switching back to it. >>>>>> A >>>>>> paper on using suffix trees to generate genome wide k-mer frequencies >>>>>> is: >>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >>>>>> software >>>>>> is tallymer). It would be some work to implement this in java as a >>>>>> module >>>>>> for biojava3 but I can see that this will be helpful. Again, for small >>>>>> fasta >>>>>> files, it might not be efficient to create a suffix tree but for bigger >>>>>> files, I think that might be the way to go. >>>>>> >>>>>> Thats just my two cents.What do you think? >>>>>> >>>>>> -vishal >>>>>> >>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <[email protected]> wrote: >>>>>> >>>>>>> Hi Vishal, >>>>>>> >>>>>>> As far as I am aware there is nothing which will generate them in >>>>>>> BioJava >>>>>>> at the moment. However it is possible to do it with BioJava3: >>>>>>> >>>>>>> public static void main(String[] args) { >>>>>>> DNASequence d = new DNASequence("ATGATC"); >>>>>>> System.out.println("Non-Overlap"); >>>>>>> nonOverlap(d); >>>>>>> System.out.println("Overlap"); >>>>>>> overlap(d); >>>>>>> } >>>>>>> >>>>>>> public static final int KMER = 3; >>>>>>> >>>>>>> //Generate triplets overlapping >>>>>>> public static void overlap(Sequence<NucleotideCompound> d) { >>>>>>> List<WindowedSequence<NucleotideCompound>> l = >>>>>>> new ArrayList<WindowedSequence<NucleotideCompound>>(); >>>>>>> for(int i=1; i<=KMER; i++) { >>>>>>> SequenceView<NucleotideCompound> sub = d.getSubSequence( >>>>>>> i, d.getLength()); >>>>>>> WindowedSequence<NucleotideCompound> w = >>>>>>> new WindowedSequence<NucleotideCompound>(sub, KMER); >>>>>>> l.add(w); >>>>>>> } >>>>>>> >>>>>>> //Will return ATG, ATC, TGA & GAT >>>>>>> for(WindowedSequence<NucleotideCompound> w: l) { >>>>>>> for(List<NucleotideCompound> subList: w) { >>>>>>> System.out.println(subList); >>>>>>> } >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> //Generate triplet Compound lists non-overlapping >>>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) { >>>>>>> WindowedSequence<NucleotideCompound> w = >>>>>>> new WindowedSequence<NucleotideCompound>(d, KMER); >>>>>>> //Will return ATG & ATC >>>>>>> for(List<NucleotideCompound> subList: w) { >>>>>>> System.out.println(subList); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> The disadvantage of all of these solutions is that they generate lists >>>>>>> of >>>>>>> Compounds so kmer generation can/will be a memory intensive operation. >>>>>> This >>>>>>> does mean it has to be since sub sequences are thin wrappers around an >>>>>>> underlying sequence. Also the overlap solution is non-optimal since it >>>>>>> iterates through each window rather than stepping through delegating >>>>>>> onto >>>>>>> each base in turn (hence why we get ATG & ATC before TGA) >>>>>>> >>>>>>> As for unique k-mers that's something which would require a bit more >>>>>>> engineering & would be better suited to a solution built around a Trie >>>>>>> (prefix tree). >>>>>>> >>>>>>> Hope this helps, >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> I had a quick question: Does Biojava have a method to generate k-mers >>>>>> or >>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >>>>>> k-mer >>>>>>>> counts for every sequence in a fasta file. If something like this >>>>>> exists >>>>>>> it >>>>>>>> would save me some time to write the code. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Vishal >>>>>>>> _______________________________________________ >>>>>>>> Biojava-l mailing list - [email protected] >>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> >>>>>>> -- >>>>>>> Andrew Yates Ensembl Genomes Engineer >>>>>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Vishal Thapar, Ph.D.* >>>>>> *Scientific informatics Analyst >>>>>> Cold Spring Harbor Lab >>>>>> Quick Bldg, Lowe Lab >>>>>> 1 Bungtown Road >>>>>> Cold Spring Harbor, NY - 11724* >>>>>> _______________________________________________ >>>>>> Biojava-l mailing list - [email protected] >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>> >>>>> _______________________________________________ >>>>> Biojava-l mailing list - [email protected] >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>>> -- >>>> Andrew Yates Ensembl Genomes Engineer >>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Biojava-l mailing list - [email protected] >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> >> -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
