Hi Nick,

It depends on the number of translations you're doing. We did some quite large 
performance testing Pjotr Prins for his chapter in Evolutionary Genomics book 
and those changes where a direct result of that *. I think the right solution 
is to revert to using the map already available in the class

Andy

* I wish I still had the stats from before & after this change to show what the 
performance impact was.

On 20 Sep 2013, at 15:57, Nick England <[email protected]> wrote:

> Andy and Andy,
> 
> Does the array really result in that much performance improvement from a
> hashmap? Rather than a huge empty array you could just use a char[][][] for
> the three codons if you want to do it that way. I think the problem is that
> the code for translating "N" as X isn't called if the array lookup returns
> a match. And the array lookup doesn't seem calibrated to cope with Ns (or
> indeed any ambiguous bases.)
> 
> And Andy Law yes, its using the int values (but used to use the numbers
> 1,2,3,4 from the looks of it.)
> 
>       public int compoundToInt(NucleotideCompound c) {
>            char b = c.getUpperedBase().charAt(0);
>            return (int)b;
> //            int v = -1;
> //            if('A' == b) {
> //                v = 1;
> //            }
> //            else if('C' == b) {
> //                v = 2;
> //            }
> //            else if('G' == b) {
> //                v = 3;
> //            }
> //            else if('T' == b || 'U' == b) {
> //                v = 4;
> //            }
> //            return v;
> }
> 
> 
> On 20 September 2013 15:51, LAW Andy <[email protected]> wrote:
> 
>> Looking at the multipliers, I would hazard a guess that the *intent* is to
>> multiply the numbers 0,1,2,3 (ACGT) rather than the ASCII codes. Are you
>> sure the code uses ASCII values?
>> 
>> 
>> On 20 Sep 2013, at 15:16, Nick England <[email protected]> wrote:
>> 
>>> Everyone,
>>> 
>>> I've stepped through with a debugger, and this is a bad bug.
>>> 
>>> The code to translate from RNA->Protein does the following:
>>> - Take the ASCII Value for the 3 RNA bases, and multiple the first pos by
>>> 16, second by 4 and third by 1 and add them up.
>>> - Assume there won't be any collisions.
>>> 
>>> Here are the values which it then uses:
>>> 
>>> A:65
>>> G:71
>>> C:67
>>> U:85
>>> N:78
>>> ANA: 1417
>>> CAU: 1417
>>> ANG: 1423
>>> CGA: 1423
>>> 
>>> Notice any hash collisions?
>>> 
>>> I don't get why this wasn't done in a standard JavaHashMap which would
>>> ensure that any collisions were resolved. This is a pretty critical bug
>> for
>>> a biology informatics package.
>>> 
>>> Nick
>>> 
>>> 
>>> On 20 September 2013 13:45, Nick England <[email protected]> wrote:
>>> 
>>>> Hara,
>>>> 
>>>> Hmm this is rather odd. I get the same issue with that sequence with a
>>>> custom engine as well.
>>>> 
>>>> My code has:
>>>> Builder builder = new TranscriptionEngine.Builder();
>>>>   builder.initMet(false);
>>>>   builder.translateNCodons(true);
>>>>   builder.trimStop(false);
>>>>   TranscriptionEngine engine = builder.build();
>>>>   Sequence<AminoAcidCompound> seq=engine.translate(new
>>>> DNASequence("GTNTGTTAGTGT"));
>>>>   assertEquals("XC*C", seq.toString());
>>>>   Sequence<AminoAcidCompound> seq2=engine.translate(new
>>>> DNASequence("ANAANG"));
>>>>   System.out.println(seq2);
>>>> the first sequence translates as expected, but your sequence is
>>>> translating as HR, when it should be XX. This looks like a pretty bad
>> bug!
>>>> 
>>>> Nick
>>>> 
>>>> 
>>>> On 19 September 2013 19:59, Hara Dilley <[email protected]> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Is there an issue with the DNA Translation in biojava3.core?
>>>>> It appears that it wants to translate "N" in certain cases
>>>>> Executing:
>>>>> new
>>>>> 
>> DNASequence("ANAANG").getRNASequence().getProteinSequence().getSequenceAsString();
>>>>> will produce  aa HR.
>>>>> 
>>>>> thanks
>>>>> Hara
>>>>> 
>>>>> ________________________________
>>>>> 
>>>>> This email and any attachments thereto may contain private,
>> confidential,
>>>>> and privileged material for the sole use of the intended recipient. Any
>>>>> review, copying, or distribution of this email (or any attachments
>> thereto)
>>>>> by others is strictly prohibited. If you are not the intended
>> recipient,
>>>>> please contact the sender immediately and permanently delete the
>> original
>>>>> and any copies of this email and any attachments thereto.
>>>>> 
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  [email protected]
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>> 
>>>> 
>>>> 
>>> _______________________________________________
>>> Biojava-l mailing list  -  [email protected]
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
>> Later,
>> 
>> Andy
>> --------
>> Yada, yada, yada...
>> 
>> Disclaimer: This e-mail and any attachments are confidential and intended
>> solely for the use of the recipient(s) to whom they are addressed. If you
>> have received it in error, please destroy all copies and inform the sender.
>> 
>> 
>> 
>> 
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>> 
>> 
> _______________________________________________
> Biojava-l mailing list  -  [email protected]
> http://lists.open-bio.org/mailman/listinfo/biojava-l


_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

Reply via email to