Re: [Biojava-l] K-mers

Andy Yates Fri, 29 Oct 2010 12:36:32 -0700

So we've got some basic kmer work now in SVN. If you look in the class 
SequenceMixin there are two static methods there for generating the two types 
of k-mers. It's not developed with Map storage in mind & I'll leave the door 
open there for anyone else to come in & develop it. The k-mers are also not 
unique across the sequence but it's a start :)


Share & enjoy!

Andy

On 29 Oct 2010, at 19:50, jitesh dundas wrote:

> I agree Andy. These have become standard functionalities that
> scientists do these days. I am all for implementing that in BioJava3.
> Java isn't that efficient for such functionalities so we will surely
> need more effort compared to the same in Python/Perl.
> 
> Regards,
> Jitesh Dundas
> 
> On 10/30/10, Andy Yates <[email protected]> wrote:
>> So if it's a suffix tree that's quite a fixed data structure so the chances
>> of developing a pluggable mechanism there would be hard. I think there also
>> has to be a limit as to what we can sensibly do. If people want to
>> contribute this kind of work though then it's all be very well received
>> (with the corresponding test environment/cases of course).
>> 
>> Cheers,
>> 
>> Andy
>> 
>> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>> 
>>> It might be useful to make the K-mer storage mechanism pluggable.  This
>>> would allow a developer to use anything from a simple MultiMap, to a NoSQL
>>> key-value database to store K-mers.  You could plugin custom map
>>> implementations to allow you to keep a count of the number of instances of
>>> particular K-mers that were found.  It might also be useful to be able to
>>> do
>>> set operations on those K-mer collections.  You could use it to determine
>>> which K-mers were present in a pathogen and not in a host.
>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
>>> 
>>> Cheers,
>>> 
>>> Mark
>>> 
>>> card.ly: <http://card.ly/phidias51>
>>> 
>>> 
>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
>>> <[email protected]>wrote:
>>> 
>>>> Hi Andy,
>>>> 
>>>> This is good to have. I feel that including it as a part of core may not
>>>> be
>>>> necessary but having it as part of Genomic module in biojava3 will be
>>>> nice.
>>>> There is a project Bioinformatica
>>>> 
>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>>>> does something similar although not exactly. It counts the k-mers in a
>>>> given fasta file but it does not count k-mers for each sequence within
>>>> the
>>>> file, just all within a file. This is a good feature to have specially if
>>>> one is trying to find patterns within sequences which is what I am trying
>>>> to
>>>> do. It would most certainly be helpful to have a k-mer counting algorithm
>>>> that counts k-mer frequency for each sequence. The way to go would be to
>>>> use
>>>> suffix trees. Again I don't know if biojava has a suffix tree api or not
>>>> since I haven't used java in a while and am just switching back to it. A
>>>> paper on using suffix trees to generate genome wide k-mer frequencies is:
>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>>>> software
>>>> is tallymer). It would be some work to implement this in java as a module
>>>> for biojava3 but I can see that this will be helpful. Again, for small
>>>> fasta
>>>> files, it might not be efficient to create a suffix tree but for bigger
>>>> files, I think that might be the way to go.
>>>> 
>>>> Thats just my two cents.What do you think?
>>>> 
>>>> -vishal
>>>> 
>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <[email protected]> wrote:
>>>> 
>>>>> Hi Vishal,
>>>>> 
>>>>> As far as I am aware there is nothing which will generate them in
>>>>> BioJava
>>>>> at the moment. However it is possible to do it with BioJava3:
>>>>> 
>>>>> public static void main(String[] args) {
>>>>>  DNASequence d = new DNASequence("ATGATC");
>>>>>  System.out.println("Non-Overlap");
>>>>>  nonOverlap(d);
>>>>>  System.out.println("Overlap");
>>>>>  overlap(d);
>>>>> }
>>>>> 
>>>>> public static final int KMER = 3;
>>>>> 
>>>>> //Generate triplets overlapping
>>>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>>>  List<WindowedSequence<NucleotideCompound>> l =
>>>>>          new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>>>  for(int i=1; i<=KMER; i++) {
>>>>>      SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>>>              i, d.getLength());
>>>>>      WindowedSequence<NucleotideCompound> w =
>>>>>          new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>>>      l.add(w);
>>>>>  }
>>>>> 
>>>>>  //Will return ATG, ATC, TGA & GAT
>>>>>  for(WindowedSequence<NucleotideCompound> w: l) {
>>>>>      for(List<NucleotideCompound> subList: w) {
>>>>>          System.out.println(subList);
>>>>>      }
>>>>>  }
>>>>> }
>>>>> 
>>>>> //Generate triplet Compound lists non-overlapping
>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>>>  WindowedSequence<NucleotideCompound> w =
>>>>>          new WindowedSequence<NucleotideCompound>(d, KMER);
>>>>>  //Will return ATG & ATC
>>>>>  for(List<NucleotideCompound> subList: w) {
>>>>>      System.out.println(subList);
>>>>>  }
>>>>> }
>>>>> 
>>>>> The disadvantage of all of these solutions is that they generate lists
>>>>> of
>>>>> Compounds so kmer generation can/will be a memory intensive operation.
>>>> This
>>>>> does mean it has to be since sub sequences are thin wrappers around an
>>>>> underlying sequence. Also the overlap solution is non-optimal since it
>>>>> iterates through each window rather than stepping through delegating
>>>>> onto
>>>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>>> 
>>>>> As for unique k-mers that's something which would require a bit more
>>>>> engineering & would be better suited to a solution built around a Trie
>>>>> (prefix tree).
>>>>> 
>>>>> Hope this helps,
>>>>> 
>>>>> Andy
>>>>> 
>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>>> 
>>>>>> Hi All,
>>>>>> 
>>>>>> I had a quick question: Does Biojava have a method to generate k-mers
>>>> or
>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>>>> k-mer
>>>>>> counts for every sequence in a fasta file. If something like this
>>>> exists
>>>>> it
>>>>>> would save me some time to write the code.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Vishal
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list  -  [email protected]
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>> 
>>>>> --
>>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> *Vishal Thapar, Ph.D.*
>>>> *Scientific informatics Analyst
>>>> Cold Spring Harbor Lab
>>>> Quick Bldg, Lowe Lab
>>>> 1 Bungtown Road
>>>> Cold Spring Harbor, NY - 11724*
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  [email protected]
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> 
>>> _______________________________________________
>>> Biojava-l mailing list  -  [email protected]
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Biojava-l mailing list  -  [email protected]
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/





_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

Re: [Biojava-l] K-mers

Reply via email to