Hi - While Sequences and SymbolLists offer many advantages over Strings or character arrays speed is not one of them.
You can create a Sequence using the SequenceFactory implementations which are much more efficient than converting to Strings and back to symbols again. This is a very expensive operation. From memory SimpleRichSequence may even have a constructor that takes a SymbolList and a name. There should be no need to convert to a String and back. Also, do you need a Sequence when a SymbolList may contain all the information you need? Finally the Edit operations you use in your wiki example will cause quite a big performance hit, your comment seems to allude to this. It would be better to collect all the non-coding points (i) and compile them into a compound location and then extract the SymbolList for that location all in one go. - Mark On Thu, Apr 24, 2008 at 8:09 PM, Florian Schatz <[EMAIL PROTECTED]> wrote: > Hello, > > I tried that, but is as slow as a version operating on Strings.. however, I > created a Cookbook entry: > http://biojava.org/wiki/BioJava:Cookbook:Sequence:ExtractGeneRegions > > Is there a better way to get a Sequence from a SybolList than: > > Sequence newsequence = DNATools.createDNASequence(symbolL.seqString(), "New > Sequence"); > > > Best, > Florian > > Am 24.04.2008 um 04:29 schrieb Mark Schreiber: > > > Hi Florian - > > > > > > > > > > There are at least two approaches. You are on the right track with > > making a union of all gene locations. The compound location that > > results from the Union will contain all the nucleotides that are > > coding. You can then iterate through each nucleotide in the genome and > > find out if the union contains the nucleotide. If it doesn't then it > > is non coding. This is surprisingly rapid as the comparisons are > > simple. The pseudo code would be something like... > > > > RichLocation coding; //initialize this by making a union of all > > locations of CDS or Gene Features. > > > > RichSequence genome; // read from file or database > > > > for(int i = 1; i <= genome.lenght(); i++){ //you might need to be a > > bit more sophisticated for a circular genome > > if( ! genome.contains(i){ > > //you have a non-coding nucleotide. > > } > > } > > > > The other approach is to use the blockIterator() method of the > > compound location that results from the union of coding sequences. > > This will output each contiguous chunk of coding sequence. If you know > > the length of the sequence then you can rapidly figure out the > > intervening pieces. > > > > For example, if the block iterator tells you that [10..50], [90..100], > > [350..380] are coding and you know the genome is of length 400 then > > you can quickly derive [1..9], [51..89], [101..349] and [381..400] are > > non-coding. Again it is more complicated for circular sequences and > > more complex if you consider the opposite strand of a gene (the gene > > shadow) to be non-coding. Unfortunately there is no convenience method > > to do this but if you code something up it would be great to put it in > > the cookbook so others can re-use it. > > > > - Mark > > > > You could actually make point locations of all the non-coding > > nucleotides and then merge the whole lot at the end into a compound > > location of non-coding > > > > On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz <[EMAIL PROTECTED]> > wrote: > > > > > Hello, > > > > > > I am new to biojava and worked a lot with in the last few weeks. I hope > > > this is the right place for questions, if not please tell me. > > > > > > I want to get the nucleotid sequence outside the genes of a genebank > file. > > > So everything that is not marked by a 'gene' feature. Unfortunately, > there > > > is no sustract or exclude function for the Location class. Any hints? > > > > > > Btw: union() of location worked fine for extracting nucleotids of the > genes > > > only. > > > > > > Best, > > > Florian > > > _______________________________________________ > > > Biojava-l mailing list - [email protected] > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > > > _______________________________________________ > Biojava-l mailing list - [email protected] > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
