Dear Friends, Thanks to Vishal & Andy for this. I actually needed this code too.. Vishal, I think Andy's suggestions may be a good option to include in BioJava 3. Would you like to add this to the BioJava 3.
Thanks again. Regards, Jitesh Dundas On 10/29/10, Andy Yates <[email protected]> wrote: > Hi Vishal, > > As far as I am aware there is nothing which will generate them in BioJava at > the moment. However it is possible to do it with BioJava3: > > public static void main(String[] args) { > DNASequence d = new DNASequence("ATGATC"); > System.out.println("Non-Overlap"); > nonOverlap(d); > System.out.println("Overlap"); > overlap(d); > } > > public static final int KMER = 3; > > //Generate triplets overlapping > public static void overlap(Sequence<NucleotideCompound> d) { > List<WindowedSequence<NucleotideCompound>> l = > new ArrayList<WindowedSequence<NucleotideCompound>>(); > for(int i=1; i<=KMER; i++) { > SequenceView<NucleotideCompound> sub = d.getSubSequence( > i, d.getLength()); > WindowedSequence<NucleotideCompound> w = > new WindowedSequence<NucleotideCompound>(sub, KMER); > l.add(w); > } > > //Will return ATG, ATC, TGA & GAT > for(WindowedSequence<NucleotideCompound> w: l) { > for(List<NucleotideCompound> subList: w) { > System.out.println(subList); > } > } > } > > //Generate triplet Compound lists non-overlapping > public static void nonOverlap(Sequence<NucleotideCompound> d) { > WindowedSequence<NucleotideCompound> w = > new WindowedSequence<NucleotideCompound>(d, KMER); > //Will return ATG & ATC > for(List<NucleotideCompound> subList: w) { > System.out.println(subList); > } > } > > The disadvantage of all of these solutions is that they generate lists of > Compounds so kmer generation can/will be a memory intensive operation. This > does mean it has to be since sub sequences are thin wrappers around an > underlying sequence. Also the overlap solution is non-optimal since it > iterates through each window rather than stepping through delegating onto > each base in turn (hence why we get ATG & ATC before TGA) > > As for unique k-mers that's something which would require a bit more > engineering & would be better suited to a solution built around a Trie > (prefix tree). > > Hope this helps, > > Andy > > On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > >> Hi All, >> >> I had a quick question: Does Biojava have a method to generate k-mers or >> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer >> counts for every sequence in a fasta file. If something like this exists >> it >> would save me some time to write the code. >> >> Thanks, >> >> Vishal >> _______________________________________________ >> Biojava-l mailing list - [email protected] >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > Biojava-l mailing list - [email protected] > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
