I agree Andy. These have become standard functionalities that scientists do these days. I am all for implementing that in BioJava3. Java isn't that efficient for such functionalities so we will surely need more effort compared to the same in Python/Perl.
Regards, Jitesh Dundas On 10/30/10, Andy Yates <[email protected]> wrote: > So if it's a suffix tree that's quite a fixed data structure so the chances > of developing a pluggable mechanism there would be hard. I think there also > has to be a limit as to what we can sensibly do. If people want to > contribute this kind of work though then it's all be very well received > (with the corresponding test environment/cases of course). > > Cheers, > > Andy > > On 29 Oct 2010, at 17:56, Mark Fortner wrote: > >> It might be useful to make the K-mer storage mechanism pluggable. This >> would allow a developer to use anything from a simple MultiMap, to a NoSQL >> key-value database to store K-mers. You could plugin custom map >> implementations to allow you to keep a count of the number of instances of >> particular K-mers that were found. It might also be useful to be able to >> do >> set operations on those K-mer collections. You could use it to determine >> which K-mers were present in a pathogen and not in a host. >> http://www.ncbi.nlm.nih.gov/pubmed/20428334 >> http://www.ncbi.nlm.nih.gov/pubmed/16403026 >> >> Cheers, >> >> Mark >> >> card.ly: <http://card.ly/phidias51> >> >> >> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar >> <[email protected]>wrote: >> >>> Hi Andy, >>> >>> This is good to have. I feel that including it as a part of core may not >>> be >>> necessary but having it as part of Genomic module in biojava3 will be >>> nice. >>> There is a project Bioinformatica >>> >>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >>> does something similar although not exactly. It counts the k-mers in a >>> given fasta file but it does not count k-mers for each sequence within >>> the >>> file, just all within a file. This is a good feature to have specially if >>> one is trying to find patterns within sequences which is what I am trying >>> to >>> do. It would most certainly be helpful to have a k-mer counting algorithm >>> that counts k-mer frequency for each sequence. The way to go would be to >>> use >>> suffix trees. Again I don't know if biojava has a suffix tree api or not >>> since I haven't used java in a while and am just switching back to it. A >>> paper on using suffix trees to generate genome wide k-mer frequencies is: >>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >>> software >>> is tallymer). It would be some work to implement this in java as a module >>> for biojava3 but I can see that this will be helpful. Again, for small >>> fasta >>> files, it might not be efficient to create a suffix tree but for bigger >>> files, I think that might be the way to go. >>> >>> Thats just my two cents.What do you think? >>> >>> -vishal >>> >>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <[email protected]> wrote: >>> >>>> Hi Vishal, >>>> >>>> As far as I am aware there is nothing which will generate them in >>>> BioJava >>>> at the moment. However it is possible to do it with BioJava3: >>>> >>>> public static void main(String[] args) { >>>> DNASequence d = new DNASequence("ATGATC"); >>>> System.out.println("Non-Overlap"); >>>> nonOverlap(d); >>>> System.out.println("Overlap"); >>>> overlap(d); >>>> } >>>> >>>> public static final int KMER = 3; >>>> >>>> //Generate triplets overlapping >>>> public static void overlap(Sequence<NucleotideCompound> d) { >>>> List<WindowedSequence<NucleotideCompound>> l = >>>> new ArrayList<WindowedSequence<NucleotideCompound>>(); >>>> for(int i=1; i<=KMER; i++) { >>>> SequenceView<NucleotideCompound> sub = d.getSubSequence( >>>> i, d.getLength()); >>>> WindowedSequence<NucleotideCompound> w = >>>> new WindowedSequence<NucleotideCompound>(sub, KMER); >>>> l.add(w); >>>> } >>>> >>>> //Will return ATG, ATC, TGA & GAT >>>> for(WindowedSequence<NucleotideCompound> w: l) { >>>> for(List<NucleotideCompound> subList: w) { >>>> System.out.println(subList); >>>> } >>>> } >>>> } >>>> >>>> //Generate triplet Compound lists non-overlapping >>>> public static void nonOverlap(Sequence<NucleotideCompound> d) { >>>> WindowedSequence<NucleotideCompound> w = >>>> new WindowedSequence<NucleotideCompound>(d, KMER); >>>> //Will return ATG & ATC >>>> for(List<NucleotideCompound> subList: w) { >>>> System.out.println(subList); >>>> } >>>> } >>>> >>>> The disadvantage of all of these solutions is that they generate lists >>>> of >>>> Compounds so kmer generation can/will be a memory intensive operation. >>> This >>>> does mean it has to be since sub sequences are thin wrappers around an >>>> underlying sequence. Also the overlap solution is non-optimal since it >>>> iterates through each window rather than stepping through delegating >>>> onto >>>> each base in turn (hence why we get ATG & ATC before TGA) >>>> >>>> As for unique k-mers that's something which would require a bit more >>>> engineering & would be better suited to a solution built around a Trie >>>> (prefix tree). >>>> >>>> Hope this helps, >>>> >>>> Andy >>>> >>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>>> >>>>> Hi All, >>>>> >>>>> I had a quick question: Does Biojava have a method to generate k-mers >>> or >>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >>> k-mer >>>>> counts for every sequence in a fasta file. If something like this >>> exists >>>> it >>>>> would save me some time to write the code. >>>>> >>>>> Thanks, >>>>> >>>>> Vishal >>>>> _______________________________________________ >>>>> Biojava-l mailing list - [email protected] >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>>> -- >>>> Andrew Yates Ensembl Genomes Engineer >>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>> >>>> >>>> >>>> >>>> >>> >>> >>> -- >>> *Vishal Thapar, Ph.D.* >>> *Scientific informatics Analyst >>> Cold Spring Harbor Lab >>> Quick Bldg, Lowe Lab >>> 1 Bungtown Road >>> Cold Spring Harbor, NY - 11724* >>> _______________________________________________ >>> Biojava-l mailing list - [email protected] >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> _______________________________________________ >> Biojava-l mailing list - [email protected] >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > Biojava-l mailing list - [email protected] > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
