One of the disadvantages of the Sequence based system is that we have no 
support for searching in sequences with patterns like regular expressions. 
Whilst it's possible to convert a Sequence into a String & then perform the 
expression but that is a sub-optimal solution.

Looking at the Pattern code in Java6 it can take in a CharSequence which one 
could write an adaptor to make a Sequence act as a CharSequence for the 
matching procedure but really it looks like a lot of work.

As for a way of doing matching to sequence HMMER3 is awesome :)

Andy

On 29 Oct 2010, at 11:00, jitesh dundas wrote:

> Dear Sir,
> 
> Is there any way to detect patterns in the recorded k-mers .
> 
> I have a large set of miRNAs (study for mutations and patgerns for
> gastric cancer).I made a record of k-mers for each sequence but the
> patterns that are generated are difficult to track.
> 
> Can BioJava do this point. Regular Expressions in Java maybe useful here..
> 
> Request expert advise  in this.Any other s/w that might be useful.
> 
> Thanks,
> Jitesh Dundas
> 
> On 10/29/10, jitesh dundas <[email protected]> wrote:
>> Dear Friends,
>> 
>> Thanks to Vishal & Andy for this. I actually needed this code too..
>> Vishal, I think Andy's suggestions may be a good option to include in
>> BioJava 3. Would you like to add this to the BioJava 3.
>> 
>> Thanks again.
>> 
>> Regards,
>> Jitesh Dundas
>> 
>> On 10/29/10, Andy Yates <[email protected]> wrote:
>>> Hi Vishal,
>>> 
>>> As far as I am aware there is nothing which will generate them in BioJava
>>> at
>>> the moment. However it is possible to do it with BioJava3:
>>> 
>>> public static void main(String[] args) {
>>>    DNASequence d = new DNASequence("ATGATC");
>>>    System.out.println("Non-Overlap");
>>>    nonOverlap(d);
>>>    System.out.println("Overlap");
>>>    overlap(d);
>>> }
>>> 
>>> public static final int KMER = 3;
>>> 
>>> //Generate triplets overlapping
>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>    List<WindowedSequence<NucleotideCompound>> l =
>>>            new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>    for(int i=1; i<=KMER; i++) {
>>>        SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>                i, d.getLength());
>>>        WindowedSequence<NucleotideCompound> w =
>>>            new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>        l.add(w);
>>>    }
>>> 
>>>    //Will return ATG, ATC, TGA & GAT
>>>    for(WindowedSequence<NucleotideCompound> w: l) {
>>>        for(List<NucleotideCompound> subList: w) {
>>>            System.out.println(subList);
>>>        }
>>>    }
>>> }
>>> 
>>> //Generate triplet Compound lists non-overlapping
>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>    WindowedSequence<NucleotideCompound> w =
>>>            new WindowedSequence<NucleotideCompound>(d, KMER);
>>>    //Will return ATG & ATC
>>>    for(List<NucleotideCompound> subList: w) {
>>>        System.out.println(subList);
>>>    }
>>> }
>>> 
>>> The disadvantage of all of these solutions is that they generate lists of
>>> Compounds so kmer generation can/will be a memory intensive operation.
>>> This
>>> does mean it has to be since sub sequences are thin wrappers around an
>>> underlying sequence. Also the overlap solution is non-optimal since it
>>> iterates through each window rather than stepping through delegating onto
>>> each base in turn (hence why we get ATG & ATC before TGA)
>>> 
>>> As for unique k-mers that's something which would require a bit more
>>> engineering & would be better suited to a solution built around a Trie
>>> (prefix tree).
>>> 
>>> Hope this helps,
>>> 
>>> Andy
>>> 
>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>> 
>>>> Hi All,
>>>> 
>>>> I had a quick question: Does Biojava have a method to generate k-mers or
>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
>>>> counts for every sequence in a fasta file. If something like this exists
>>>> it
>>>> would save me some time to write the code.
>>>> 
>>>> Thanks,
>>>> 
>>>> Vishal
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  [email protected]
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>> 
>>> --
>>> Andrew Yates                   Ensembl Genomes Engineer
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Biojava-l mailing list  -  [email protected]
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>> 
>> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/





_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

Reply via email to