I was thinking more along the lines of using something that implements the Map interface. This would allow a developer to easily unit test the code without having to load the data for a genome. You would also be able to provide different implementations to suit your needs. If you wanted to use a suffix tree as the underlying implementation, that would be OK, but you would have other options as well.
Cheers, Mark card.ly: <http://card.ly/phidias51> On Fri, Oct 29, 2010 at 11:35 AM, Andy Yates <[email protected]> wrote: > So if it's a suffix tree that's quite a fixed data structure so the chances > of developing a pluggable mechanism there would be hard. I think there also > has to be a limit as to what we can sensibly do. If people want to > contribute this kind of work though then it's all be very well received > (with the corresponding test environment/cases of course). > > Cheers, > > Andy > > On 29 Oct 2010, at 17:56, Mark Fortner wrote: > > > It might be useful to make the K-mer storage mechanism pluggable. This > > would allow a developer to use anything from a simple MultiMap, to a > NoSQL > > key-value database to store K-mers. You could plugin custom map > > implementations to allow you to keep a count of the number of instances > of > > particular K-mers that were found. It might also be useful to be able to > do > > set operations on those K-mer collections. You could use it to determine > > which K-mers were present in a pathogen and not in a host. > > http://www.ncbi.nlm.nih.gov/pubmed/20428334 > > http://www.ncbi.nlm.nih.gov/pubmed/16403026 > > > > Cheers, > > > > Mark > > > > card.ly: <http://card.ly/phidias51> > > > > > > On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar <[email protected] > >wrote: > > > >> Hi Andy, > >> > >> This is good to have. I feel that including it as a part of core may not > be > >> necessary but having it as part of Genomic module in biojava3 will be > nice. > >> There is a project Bioinformatica > >> > >> > http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich > >> does something similar although not exactly. It counts the k-mers in a > >> given fasta file but it does not count k-mers for each sequence within > the > >> file, just all within a file. This is a good feature to have specially > if > >> one is trying to find patterns within sequences which is what I am > trying > >> to > >> do. It would most certainly be helpful to have a k-mer counting > algorithm > >> that counts k-mer frequency for each sequence. The way to go would be to > >> use > >> suffix trees. Again I don't know if biojava has a suffix tree api or not > >> since I haven't used java in a while and am just switching back to it. A > >> paper on using suffix trees to generate genome wide k-mer frequencies > is: > >> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, > >> software > >> is tallymer). It would be some work to implement this in java as a > module > >> for biojava3 but I can see that this will be helpful. Again, for small > >> fasta > >> files, it might not be efficient to create a suffix tree but for bigger > >> files, I think that might be the way to go. > >> > >> Thats just my two cents.What do you think? > >> > >> -vishal > >> > >> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <[email protected]> wrote: > >> > >>> Hi Vishal, > >>> > >>> As far as I am aware there is nothing which will generate them in > BioJava > >>> at the moment. However it is possible to do it with BioJava3: > >>> > >>> public static void main(String[] args) { > >>> DNASequence d = new DNASequence("ATGATC"); > >>> System.out.println("Non-Overlap"); > >>> nonOverlap(d); > >>> System.out.println("Overlap"); > >>> overlap(d); > >>> } > >>> > >>> public static final int KMER = 3; > >>> > >>> //Generate triplets overlapping > >>> public static void overlap(Sequence<NucleotideCompound> d) { > >>> List<WindowedSequence<NucleotideCompound>> l = > >>> new ArrayList<WindowedSequence<NucleotideCompound>>(); > >>> for(int i=1; i<=KMER; i++) { > >>> SequenceView<NucleotideCompound> sub = d.getSubSequence( > >>> i, d.getLength()); > >>> WindowedSequence<NucleotideCompound> w = > >>> new WindowedSequence<NucleotideCompound>(sub, KMER); > >>> l.add(w); > >>> } > >>> > >>> //Will return ATG, ATC, TGA & GAT > >>> for(WindowedSequence<NucleotideCompound> w: l) { > >>> for(List<NucleotideCompound> subList: w) { > >>> System.out.println(subList); > >>> } > >>> } > >>> } > >>> > >>> //Generate triplet Compound lists non-overlapping > >>> public static void nonOverlap(Sequence<NucleotideCompound> d) { > >>> WindowedSequence<NucleotideCompound> w = > >>> new WindowedSequence<NucleotideCompound>(d, KMER); > >>> //Will return ATG & ATC > >>> for(List<NucleotideCompound> subList: w) { > >>> System.out.println(subList); > >>> } > >>> } > >>> > >>> The disadvantage of all of these solutions is that they generate lists > of > >>> Compounds so kmer generation can/will be a memory intensive operation. > >> This > >>> does mean it has to be since sub sequences are thin wrappers around an > >>> underlying sequence. Also the overlap solution is non-optimal since it > >>> iterates through each window rather than stepping through delegating > onto > >>> each base in turn (hence why we get ATG & ATC before TGA) > >>> > >>> As for unique k-mers that's something which would require a bit more > >>> engineering & would be better suited to a solution built around a Trie > >>> (prefix tree). > >>> > >>> Hope this helps, > >>> > >>> Andy > >>> > >>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > >>> > >>>> Hi All, > >>>> > >>>> I had a quick question: Does Biojava have a method to generate k-mers > >> or > >>>> K-mer counting in a given Fasta Sequence / File? Basically, I want > >> k-mer > >>>> counts for every sequence in a fasta file. If something like this > >> exists > >>> it > >>>> would save me some time to write the code. > >>>> > >>>> Thanks, > >>>> > >>>> Vishal > >>>> _______________________________________________ > >>>> Biojava-l mailing list - [email protected] > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>> > >>> -- > >>> Andrew Yates Ensembl Genomes Engineer > >>> EMBL-EBI Tel: +44-(0)1223-492538 > >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > >>> > >>> > >>> > >>> > >>> > >> > >> > >> -- > >> *Vishal Thapar, Ph.D.* > >> *Scientific informatics Analyst > >> Cold Spring Harbor Lab > >> Quick Bldg, Lowe Lab > >> 1 Bungtown Road > >> Cold Spring Harbor, NY - 11724* > >> _______________________________________________ > >> Biojava-l mailing list - [email protected] > >> http://lists.open-bio.org/mailman/listinfo/biojava-l > >> > > _______________________________________________ > > Biojava-l mailing list - [email protected] > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
