Re: [Biojava-l] issue with translating codons with N

Andy Yates Fri, 20 Sep 2013 09:01:36 -0700

Hi Nick,

It depends on the number of translations you're doing. We did some quite large 
performance testing Pjotr Prins for his chapter in Evolutionary Genomics book 
and those changes where a direct result of that *. I think the right solution 
is to revert to using the map already available in the class


Andy

* I wish I still had the stats from before & after this change to show what the 
performance impact was.

On 20 Sep 2013, at 15:57, Nick England <[email protected]> wrote:

> Andy and Andy,
> 
> Does the array really result in that much performance improvement from a
> hashmap? Rather than a huge empty array you could just use a char[][][] for
> the three codons if you want to do it that way. I think the problem is that
> the code for translating "N" as X isn't called if the array lookup returns
> a match. And the array lookup doesn't seem calibrated to cope with Ns (or
> indeed any ambiguous bases.)
> 
> And Andy Law yes, its using the int values (but used to use the numbers
> 1,2,3,4 from the looks of it.)
> 
>       public int compoundToInt(NucleotideCompound c) {
>            char b = c.getUpperedBase().charAt(0);
>            return (int)b;
> //            int v = -1;
> //            if('A' == b) {
> //                v = 1;
> //            }
> //            else if('C' == b) {
> //                v = 2;
> //            }
> //            else if('G' == b) {
> //                v = 3;
> //            }
> //            else if('T' == b || 'U' == b) {
> //                v = 4;
> //            }
> //            return v;
> }
> 
> 
> On 20 September 2013 15:51, LAW Andy <[email protected]> wrote:
> 
>> Looking at the multipliers, I would hazard a guess that the *intent* is to
>> multiply the numbers 0,1,2,3 (ACGT) rather than the ASCII codes. Are you
>> sure the code uses ASCII values?
>> 
>> 
>> On 20 Sep 2013, at 15:16, Nick England <[email protected]> wrote:
>> 
>>> Everyone,
>>> 
>>> I've stepped through with a debugger, and this is a bad bug.
>>> 
>>> The code to translate from RNA->Protein does the following:
>>> - Take the ASCII Value for the 3 RNA bases, and multiple the first pos by
>>> 16, second by 4 and third by 1 and add them up.
>>> - Assume there won't be any collisions.
>>> 
>>> Here are the values which it then uses:
>>> 
>>> A:65
>>> G:71
>>> C:67
>>> U:85
>>> N:78
>>> ANA: 1417
>>> CAU: 1417
>>> ANG: 1423
>>> CGA: 1423
>>> 
>>> Notice any hash collisions?
>>> 
>>> I don't get why this wasn't done in a standard JavaHashMap which would
>>> ensure that any collisions were resolved. This is a pretty critical bug
>> for
>>> a biology informatics package.
>>> 
>>> Nick
>>> 
>>> 
>>> On 20 September 2013 13:45, Nick England <[email protected]> wrote:
>>> 
>>>> Hara,
>>>> 
>>>> Hmm this is rather odd. I get the same issue with that sequence with a
>>>> custom engine as well.
>>>> 
>>>> My code has:
>>>> Builder builder = new TranscriptionEngine.Builder();
>>>>   builder.initMet(false);
>>>>   builder.translateNCodons(true);
>>>>   builder.trimStop(false);
>>>>   TranscriptionEngine engine = builder.build();
>>>>   Sequence<AminoAcidCompound> seq=engine.translate(new
>>>> DNASequence("GTNTGTTAGTGT"));
>>>>   assertEquals("XC*C", seq.toString());
>>>>   Sequence<AminoAcidCompound> seq2=engine.translate(new
>>>> DNASequence("ANAANG"));
>>>>   System.out.println(seq2);
>>>> the first sequence translates as expected, but your sequence is
>>>> translating as HR, when it should be XX. This looks like a pretty bad
>> bug!
>>>> 
>>>> Nick
>>>> 
>>>> 
>>>> On 19 September 2013 19:59, Hara Dilley <[email protected]> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Is there an issue with the DNA Translation in biojava3.core?
>>>>> It appears that it wants to translate "N" in certain cases
>>>>> Executing:
>>>>> new
>>>>> 
>> DNASequence("ANAANG").getRNASequence().getProteinSequence().getSequenceAsString();
>>>>> will produce  aa HR.
>>>>> 
>>>>> thanks
>>>>> Hara
>>>>> 
>>>>> ________________________________
>>>>> 
>>>>> This email and any attachments thereto may contain private,
>> confidential,
>>>>> and privileged material for the sole use of the intended recipient. Any
>>>>> review, copying, or distribution of this email (or any attachments
>> thereto)
>>>>> by others is strictly prohibited. If you are not the intended
>> recipient,
>>>>> please contact the sender immediately and permanently delete the
>> original
>>>>> and any copies of this email and any attachments thereto.
>>>>> 
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  [email protected]
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>> 
>>>> 
>>>> 
>>> _______________________________________________
>>> Biojava-l mailing list  -  [email protected]
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
>> Later,
>> 
>> Andy
>> --------
>> Yada, yada, yada...
>> 
>> Disclaimer: This e-mail and any attachments are confidential and intended
>> solely for the use of the recipient(s) to whom they are addressed. If you
>> have received it in error, please destroy all copies and inform the sender.
>> 
>> 
>> 
>> 
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>> 
>> 
> _______________________________________________
> Biojava-l mailing list  -  [email protected]
> http://lists.open-bio.org/mailman/listinfo/biojava-l


_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

Re: [Biojava-l] issue with translating codons with N

Reply via email to