Re: [Biojava-l] K-mers

Andy Yates Fri, 29 Oct 2010 11:34:33 -0700

Hi Vishal,

There's no suffix tree impl in BioJava but if you want to give it a shot then 
go for it :). I'm interested in how they work but no time to implement it. As 
for efficiency give it a shot & lets see what it does.


Andy

On 29 Oct 2010, at 17:27, Vishal Thapar wrote:

> Hi Andy,
> 
> This is good to have. I feel that including it as a part of core may not be 
> necessary but having it as part of Genomic module in biojava3 will be nice. 
> There is a project Bioinformatica 
> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequence
>  which does something similar although not exactly. It counts the k-mers in a 
> given fasta file but it does not count k-mers for each sequence within the 
> file, just all within a file. This is a good feature to have specially if one 
> is trying to find patterns within sequences which is what I am trying to do. 
> It would most certainly be helpful to have a k-mer counting algorithm that 
> counts k-mer frequency for each sequence. The way to go would be to use 
> suffix trees. Again I don't know if biojava has a suffix tree api or not 
> since I haven't used java in a while and am just switching back to it. A 
> paper on using suffix trees to generate genome wide k-mer frequencies is: 
> http://www.biomedcentral.com/1471-2164!
 /9/517/abstract (kurtz et al, software is tallymer). It would be some work to 
implement this in java as a module for biojava3 but I can see that this will be 
helpful. Again, for small fasta files, it might not be efficient to create a 
suffix tree but for bigger files, I think that might be the way to go.
> 
> Thats just my two cents.What do you think?
> 
> -vishal
> 
> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <[email protected]> wrote:
> Hi Vishal,
> 
> As far as I am aware there is nothing which will generate them in BioJava at 
> the moment. However it is possible to do it with BioJava3:
> 
> public static void main(String[] args) {
>    DNASequence d = new DNASequence("ATGATC");
>    System.out.println("Non-Overlap");
>    nonOverlap(d);
>    System.out.println("Overlap");
>    overlap(d);
> }
> 
> public static final int KMER = 3;
> 
> //Generate triplets overlapping
> public static void overlap(Sequence<NucleotideCompound> d) {
>    List<WindowedSequence<NucleotideCompound>> l =
>            new ArrayList<WindowedSequence<NucleotideCompound>>();
>    for(int i=1; i<=KMER; i++) {
>        SequenceView<NucleotideCompound> sub = d.getSubSequence(
>                i, d.getLength());
>        WindowedSequence<NucleotideCompound> w =
>            new WindowedSequence<NucleotideCompound>(sub, KMER);
>        l.add(w);
>    }
> 
>    //Will return ATG, ATC, TGA & GAT
>    for(WindowedSequence<NucleotideCompound> w: l) {
>        for(List<NucleotideCompound> subList: w) {
>            System.out.println(subList);
>        }
>    }
> }
> 
> //Generate triplet Compound lists non-overlapping
> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>    WindowedSequence<NucleotideCompound> w =
>            new WindowedSequence<NucleotideCompound>(d, KMER);
>    //Will return ATG & ATC
>    for(List<NucleotideCompound> subList: w) {
>        System.out.println(subList);
>    }
> }
> 
> The disadvantage of all of these solutions is that they generate lists of 
> Compounds so kmer generation can/will be a memory intensive operation. This 
> does mean it has to be since sub sequences are thin wrappers around an 
> underlying sequence. Also the overlap solution is non-optimal since it 
> iterates through each window rather than stepping through delegating onto 
> each base in turn (hence why we get ATG & ATC before TGA)
> 
> As for unique k-mers that's something which would require a bit more 
> engineering & would be better suited to a solution built around a Trie 
> (prefix tree).
> 
> Hope this helps,
> 
> Andy
> 
> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
> 
> > Hi All,
> >
> > I had a quick question: Does Biojava have a method to generate k-mers or
> > K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
> > counts for every sequence in a fasta file. If something like this exists it
> > would save me some time to write the code.
> >
> > Thanks,
> >
> > Vishal
> > _______________________________________________
> > Biojava-l mailing list  -  [email protected]
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> 
> 
> 
> 
> 
> 
> 
> -- 
> Vishal Thapar, Ph.D.
> Scientific informatics Analyst
> Cold Spring Harbor Lab
> Quick Bldg, Lowe Lab
> 1 Bungtown Road
> Cold Spring Harbor, NY - 11724
> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/





_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

Re: [Biojava-l] K-mers

Reply via email to