Hi Nick, It depends on the number of translations you're doing. We did some quite large performance testing Pjotr Prins for his chapter in Evolutionary Genomics book and those changes where a direct result of that *. I think the right solution is to revert to using the map already available in the class
Andy * I wish I still had the stats from before & after this change to show what the performance impact was. On 20 Sep 2013, at 15:57, Nick England <[email protected]> wrote: > Andy and Andy, > > Does the array really result in that much performance improvement from a > hashmap? Rather than a huge empty array you could just use a char[][][] for > the three codons if you want to do it that way. I think the problem is that > the code for translating "N" as X isn't called if the array lookup returns > a match. And the array lookup doesn't seem calibrated to cope with Ns (or > indeed any ambiguous bases.) > > And Andy Law yes, its using the int values (but used to use the numbers > 1,2,3,4 from the looks of it.) > > public int compoundToInt(NucleotideCompound c) { > char b = c.getUpperedBase().charAt(0); > return (int)b; > // int v = -1; > // if('A' == b) { > // v = 1; > // } > // else if('C' == b) { > // v = 2; > // } > // else if('G' == b) { > // v = 3; > // } > // else if('T' == b || 'U' == b) { > // v = 4; > // } > // return v; > } > > > On 20 September 2013 15:51, LAW Andy <[email protected]> wrote: > >> Looking at the multipliers, I would hazard a guess that the *intent* is to >> multiply the numbers 0,1,2,3 (ACGT) rather than the ASCII codes. Are you >> sure the code uses ASCII values? >> >> >> On 20 Sep 2013, at 15:16, Nick England <[email protected]> wrote: >> >>> Everyone, >>> >>> I've stepped through with a debugger, and this is a bad bug. >>> >>> The code to translate from RNA->Protein does the following: >>> - Take the ASCII Value for the 3 RNA bases, and multiple the first pos by >>> 16, second by 4 and third by 1 and add them up. >>> - Assume there won't be any collisions. >>> >>> Here are the values which it then uses: >>> >>> A:65 >>> G:71 >>> C:67 >>> U:85 >>> N:78 >>> ANA: 1417 >>> CAU: 1417 >>> ANG: 1423 >>> CGA: 1423 >>> >>> Notice any hash collisions? >>> >>> I don't get why this wasn't done in a standard JavaHashMap which would >>> ensure that any collisions were resolved. This is a pretty critical bug >> for >>> a biology informatics package. >>> >>> Nick >>> >>> >>> On 20 September 2013 13:45, Nick England <[email protected]> wrote: >>> >>>> Hara, >>>> >>>> Hmm this is rather odd. I get the same issue with that sequence with a >>>> custom engine as well. >>>> >>>> My code has: >>>> Builder builder = new TranscriptionEngine.Builder(); >>>> builder.initMet(false); >>>> builder.translateNCodons(true); >>>> builder.trimStop(false); >>>> TranscriptionEngine engine = builder.build(); >>>> Sequence<AminoAcidCompound> seq=engine.translate(new >>>> DNASequence("GTNTGTTAGTGT")); >>>> assertEquals("XC*C", seq.toString()); >>>> Sequence<AminoAcidCompound> seq2=engine.translate(new >>>> DNASequence("ANAANG")); >>>> System.out.println(seq2); >>>> the first sequence translates as expected, but your sequence is >>>> translating as HR, when it should be XX. This looks like a pretty bad >> bug! >>>> >>>> Nick >>>> >>>> >>>> On 19 September 2013 19:59, Hara Dilley <[email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> Is there an issue with the DNA Translation in biojava3.core? >>>>> It appears that it wants to translate "N" in certain cases >>>>> Executing: >>>>> new >>>>> >> DNASequence("ANAANG").getRNASequence().getProteinSequence().getSequenceAsString(); >>>>> will produce aa HR. >>>>> >>>>> thanks >>>>> Hara >>>>> >>>>> ________________________________ >>>>> >>>>> This email and any attachments thereto may contain private, >> confidential, >>>>> and privileged material for the sole use of the intended recipient. Any >>>>> review, copying, or distribution of this email (or any attachments >> thereto) >>>>> by others is strictly prohibited. If you are not the intended >> recipient, >>>>> please contact the sender immediately and permanently delete the >> original >>>>> and any copies of this email and any attachments thereto. >>>>> >>>>> _______________________________________________ >>>>> Biojava-l mailing list - [email protected] >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>>> >>>> >>> _______________________________________________ >>> Biojava-l mailing list - [email protected] >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> Later, >> >> Andy >> -------- >> Yada, yada, yada... >> >> Disclaimer: This e-mail and any attachments are confidential and intended >> solely for the use of the recipient(s) to whom they are addressed. If you >> have received it in error, please destroy all copies and inform the sender. >> >> >> >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> > _______________________________________________ > Biojava-l mailing list - [email protected] > http://lists.open-bio.org/mailman/listinfo/biojava-l _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
