One of the disadvantages of the Sequence based system is that we have no support for searching in sequences with patterns like regular expressions. Whilst it's possible to convert a Sequence into a String & then perform the expression but that is a sub-optimal solution.
Looking at the Pattern code in Java6 it can take in a CharSequence which one could write an adaptor to make a Sequence act as a CharSequence for the matching procedure but really it looks like a lot of work. As for a way of doing matching to sequence HMMER3 is awesome :) Andy On 29 Oct 2010, at 11:00, jitesh dundas wrote: > Dear Sir, > > Is there any way to detect patterns in the recorded k-mers . > > I have a large set of miRNAs (study for mutations and patgerns for > gastric cancer).I made a record of k-mers for each sequence but the > patterns that are generated are difficult to track. > > Can BioJava do this point. Regular Expressions in Java maybe useful here.. > > Request expert advise in this.Any other s/w that might be useful. > > Thanks, > Jitesh Dundas > > On 10/29/10, jitesh dundas <[email protected]> wrote: >> Dear Friends, >> >> Thanks to Vishal & Andy for this. I actually needed this code too.. >> Vishal, I think Andy's suggestions may be a good option to include in >> BioJava 3. Would you like to add this to the BioJava 3. >> >> Thanks again. >> >> Regards, >> Jitesh Dundas >> >> On 10/29/10, Andy Yates <[email protected]> wrote: >>> Hi Vishal, >>> >>> As far as I am aware there is nothing which will generate them in BioJava >>> at >>> the moment. However it is possible to do it with BioJava3: >>> >>> public static void main(String[] args) { >>> DNASequence d = new DNASequence("ATGATC"); >>> System.out.println("Non-Overlap"); >>> nonOverlap(d); >>> System.out.println("Overlap"); >>> overlap(d); >>> } >>> >>> public static final int KMER = 3; >>> >>> //Generate triplets overlapping >>> public static void overlap(Sequence<NucleotideCompound> d) { >>> List<WindowedSequence<NucleotideCompound>> l = >>> new ArrayList<WindowedSequence<NucleotideCompound>>(); >>> for(int i=1; i<=KMER; i++) { >>> SequenceView<NucleotideCompound> sub = d.getSubSequence( >>> i, d.getLength()); >>> WindowedSequence<NucleotideCompound> w = >>> new WindowedSequence<NucleotideCompound>(sub, KMER); >>> l.add(w); >>> } >>> >>> //Will return ATG, ATC, TGA & GAT >>> for(WindowedSequence<NucleotideCompound> w: l) { >>> for(List<NucleotideCompound> subList: w) { >>> System.out.println(subList); >>> } >>> } >>> } >>> >>> //Generate triplet Compound lists non-overlapping >>> public static void nonOverlap(Sequence<NucleotideCompound> d) { >>> WindowedSequence<NucleotideCompound> w = >>> new WindowedSequence<NucleotideCompound>(d, KMER); >>> //Will return ATG & ATC >>> for(List<NucleotideCompound> subList: w) { >>> System.out.println(subList); >>> } >>> } >>> >>> The disadvantage of all of these solutions is that they generate lists of >>> Compounds so kmer generation can/will be a memory intensive operation. >>> This >>> does mean it has to be since sub sequences are thin wrappers around an >>> underlying sequence. Also the overlap solution is non-optimal since it >>> iterates through each window rather than stepping through delegating onto >>> each base in turn (hence why we get ATG & ATC before TGA) >>> >>> As for unique k-mers that's something which would require a bit more >>> engineering & would be better suited to a solution built around a Trie >>> (prefix tree). >>> >>> Hope this helps, >>> >>> Andy >>> >>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>> >>>> Hi All, >>>> >>>> I had a quick question: Does Biojava have a method to generate k-mers or >>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer >>>> counts for every sequence in a fasta file. If something like this exists >>>> it >>>> would save me some time to write the code. >>>> >>>> Thanks, >>>> >>>> Vishal >>>> _______________________________________________ >>>> Biojava-l mailing list - [email protected] >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> -- >>> Andrew Yates Ensembl Genomes Engineer >>> EMBL-EBI Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Biojava-l mailing list - [email protected] >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
