I got your point Andy. .Thanks. On Sat, Oct 30, 2010 at 2:50 PM, Andy Yates <[email protected]> wrote:
> You should be aware I just found a bug in the code. This has been fixed but > the bug will still be in the alpha3 release. I would recommend either > building a version yourself or if Andreas can post up the continuous > integration server address there will be a release tonight. > > Just goes to show you should always do more testing than you think :). > > Andy > > On 29 Oct 2010, at 20:43, jitesh dundas wrote: > > > That is good news.Thanks for the directions Andy. > > > > I have already started on this.Let me analyze and write the code now. > > > > Maybe a next month deadline is not unreachable in this case. > > > > Here we go! > > JD > > > > On 10/30/10, Andy Yates <[email protected]> wrote: > >> So we've got some basic kmer work now in SVN. If you look in the class > >> SequenceMixin there are two static methods there for generating the two > >> types of k-mers. It's not developed with Map storage in mind & I'll > leave > >> the door open there for anyone else to come in & develop it. The k-mers > are > >> also not unique across the sequence but it's a start :) > >> > >> Share & enjoy! > >> > >> Andy > >> > >> On 29 Oct 2010, at 19:50, jitesh dundas wrote: > >> > >>> I agree Andy. These have become standard functionalities that > >>> scientists do these days. I am all for implementing that in BioJava3. > >>> Java isn't that efficient for such functionalities so we will surely > >>> need more effort compared to the same in Python/Perl. > >>> > >>> Regards, > >>> Jitesh Dundas > >>> > >>> On 10/30/10, Andy Yates <[email protected]> wrote: > >>>> So if it's a suffix tree that's quite a fixed data structure so the > >>>> chances > >>>> of developing a pluggable mechanism there would be hard. I think there > >>>> also > >>>> has to be a limit as to what we can sensibly do. If people want to > >>>> contribute this kind of work though then it's all be very well > received > >>>> (with the corresponding test environment/cases of course). > >>>> > >>>> Cheers, > >>>> > >>>> Andy > >>>> > >>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote: > >>>> > >>>>> It might be useful to make the K-mer storage mechanism pluggable. > This > >>>>> would allow a developer to use anything from a simple MultiMap, to a > >>>>> NoSQL > >>>>> key-value database to store K-mers. You could plugin custom map > >>>>> implementations to allow you to keep a count of the number of > instances > >>>>> of > >>>>> particular K-mers that were found. It might also be useful to be > able > >>>>> to > >>>>> do > >>>>> set operations on those K-mer collections. You could use it to > >>>>> determine > >>>>> which K-mers were present in a pathogen and not in a host. > >>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334 > >>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026 > >>>>> > >>>>> Cheers, > >>>>> > >>>>> Mark > >>>>> > >>>>> card.ly: <http://card.ly/phidias51> > >>>>> > >>>>> > >>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar > >>>>> <[email protected]>wrote: > >>>>> > >>>>>> Hi Andy, > >>>>>> > >>>>>> This is good to have. I feel that including it as a part of core may > >>>>>> not > >>>>>> be > >>>>>> necessary but having it as part of Genomic module in biojava3 will > be > >>>>>> nice. > >>>>>> There is a project Bioinformatica > >>>>>> > >>>>>> > http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich > >>>>>> does something similar although not exactly. It counts the k-mers in > a > >>>>>> given fasta file but it does not count k-mers for each sequence > within > >>>>>> the > >>>>>> file, just all within a file. This is a good feature to have > specially > >>>>>> if > >>>>>> one is trying to find patterns within sequences which is what I am > >>>>>> trying > >>>>>> to > >>>>>> do. It would most certainly be helpful to have a k-mer counting > >>>>>> algorithm > >>>>>> that counts k-mer frequency for each sequence. The way to go would > be > >>>>>> to > >>>>>> use > >>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or > >>>>>> not > >>>>>> since I haven't used java in a while and am just switching back to > it. > >>>>>> A > >>>>>> paper on using suffix trees to generate genome wide k-mer > frequencies > >>>>>> is: > >>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, > >>>>>> software > >>>>>> is tallymer). It would be some work to implement this in java as a > >>>>>> module > >>>>>> for biojava3 but I can see that this will be helpful. Again, for > small > >>>>>> fasta > >>>>>> files, it might not be efficient to create a suffix tree but for > bigger > >>>>>> files, I think that might be the way to go. > >>>>>> > >>>>>> Thats just my two cents.What do you think? > >>>>>> > >>>>>> -vishal > >>>>>> > >>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <[email protected]> > wrote: > >>>>>> > >>>>>>> Hi Vishal, > >>>>>>> > >>>>>>> As far as I am aware there is nothing which will generate them in > >>>>>>> BioJava > >>>>>>> at the moment. However it is possible to do it with BioJava3: > >>>>>>> > >>>>>>> public static void main(String[] args) { > >>>>>>> DNASequence d = new DNASequence("ATGATC"); > >>>>>>> System.out.println("Non-Overlap"); > >>>>>>> nonOverlap(d); > >>>>>>> System.out.println("Overlap"); > >>>>>>> overlap(d); > >>>>>>> } > >>>>>>> > >>>>>>> public static final int KMER = 3; > >>>>>>> > >>>>>>> //Generate triplets overlapping > >>>>>>> public static void overlap(Sequence<NucleotideCompound> d) { > >>>>>>> List<WindowedSequence<NucleotideCompound>> l = > >>>>>>> new ArrayList<WindowedSequence<NucleotideCompound>>(); > >>>>>>> for(int i=1; i<=KMER; i++) { > >>>>>>> SequenceView<NucleotideCompound> sub = d.getSubSequence( > >>>>>>> i, d.getLength()); > >>>>>>> WindowedSequence<NucleotideCompound> w = > >>>>>>> new WindowedSequence<NucleotideCompound>(sub, KMER); > >>>>>>> l.add(w); > >>>>>>> } > >>>>>>> > >>>>>>> //Will return ATG, ATC, TGA & GAT > >>>>>>> for(WindowedSequence<NucleotideCompound> w: l) { > >>>>>>> for(List<NucleotideCompound> subList: w) { > >>>>>>> System.out.println(subList); > >>>>>>> } > >>>>>>> } > >>>>>>> } > >>>>>>> > >>>>>>> //Generate triplet Compound lists non-overlapping > >>>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) { > >>>>>>> WindowedSequence<NucleotideCompound> w = > >>>>>>> new WindowedSequence<NucleotideCompound>(d, KMER); > >>>>>>> //Will return ATG & ATC > >>>>>>> for(List<NucleotideCompound> subList: w) { > >>>>>>> System.out.println(subList); > >>>>>>> } > >>>>>>> } > >>>>>>> > >>>>>>> The disadvantage of all of these solutions is that they generate > lists > >>>>>>> of > >>>>>>> Compounds so kmer generation can/will be a memory intensive > operation. > >>>>>> This > >>>>>>> does mean it has to be since sub sequences are thin wrappers around > an > >>>>>>> underlying sequence. Also the overlap solution is non-optimal since > it > >>>>>>> iterates through each window rather than stepping through > delegating > >>>>>>> onto > >>>>>>> each base in turn (hence why we get ATG & ATC before TGA) > >>>>>>> > >>>>>>> As for unique k-mers that's something which would require a bit > more > >>>>>>> engineering & would be better suited to a solution built around a > Trie > >>>>>>> (prefix tree). > >>>>>>> > >>>>>>> Hope this helps, > >>>>>>> > >>>>>>> Andy > >>>>>>> > >>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > >>>>>>> > >>>>>>>> Hi All, > >>>>>>>> > >>>>>>>> I had a quick question: Does Biojava have a method to generate > k-mers > >>>>>> or > >>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want > >>>>>> k-mer > >>>>>>>> counts for every sequence in a fasta file. If something like this > >>>>>> exists > >>>>>>> it > >>>>>>>> would save me some time to write the code. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> > >>>>>>>> Vishal > >>>>>>>> _______________________________________________ > >>>>>>>> Biojava-l mailing list - [email protected] > >>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>>>>> > >>>>>>> -- > >>>>>>> Andrew Yates Ensembl Genomes Engineer > >>>>>>> EMBL-EBI Tel: +44-(0)1223-492538 > >>>>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > >>>>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> *Vishal Thapar, Ph.D.* > >>>>>> *Scientific informatics Analyst > >>>>>> Cold Spring Harbor Lab > >>>>>> Quick Bldg, Lowe Lab > >>>>>> 1 Bungtown Road > >>>>>> Cold Spring Harbor, NY - 11724* > >>>>>> _______________________________________________ > >>>>>> Biojava-l mailing list - [email protected] > >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>>>> > >>>>> _______________________________________________ > >>>>> Biojava-l mailing list - [email protected] > >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> > >>>> -- > >>>> Andrew Yates Ensembl Genomes Engineer > >>>> EMBL-EBI Tel: +44-(0)1223-492538 > >>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > >>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Biojava-l mailing list - [email protected] > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> > >> > >> -- > >> Andrew Yates Ensembl Genomes Engineer > >> EMBL-EBI Tel: +44-(0)1223-492538 > >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > >> > >> > >> > >> > >> > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
