[pygr] seq.translation() (was: blast issues to decide)

C. Titus Brown Thu, 16 Jul 2009 19:56:34 -0700

On Thu, Jul 16, 2009 at 11:50:02AM -0700, Christopher Lee wrote:
-> On Jul 15, 2009, at 10:37 PM, C. Titus Brown wrote:
-> > One of the things I wanted to do with my code was to provide a
-> > 'translation' on sequences that would let you translate a DNA sequence
-> > into a protein sequence in any specified frame.  What do you think?
-> 
-> If I'm understanding your question right, TranslationDB does this  
-> pretty directly.  Say you have a nucleotide sequence database object  
-> db.  Then you can get the translation of any desired frame or subslice  
-> of a sequence in db as easily as:
-> 
-> tdb = TranslationDB(db)
-> # 100 AA beginning at nucleotide i on positive strand
-> orf100 = tdb[seqID][i:i+300]
-> # 100 AA for negative strand (reverse-comp) of same nt interval
-> orf100rc = (-(tdb[seqID]))[-i - 300: -i]


It seems simpler, to me, to do:

  seq = db[seqID][i:]
  orf100 = seq.translation()[:100]

Here 'translation()' can take any frame, e.g. +1, -1, -3, etc.

Check out the simpleframe_ctb branch on my github repo, which implements
the 'translation' method.  (I also added in a bunch of tests for it in
sequence_test.py.)

The original motivation was to support code like this,

  blastx_matches = blastx_map[dna]
  dna_slice = dna[x:y]

  frame1 = dna_slice.translation(1)
  frame1_matches = blastx_matches[frame1]

  frame2 = dna_slice.translation(2)
  frame2_matches = blastx_matches[frame2]

The problem that I ran into with this code is that slices of sequences
(dna_slice as opposed to dna) need to have their frames adjusted, and I
appear to be incapable of basic math this month.

-> Alternatively, you can directly request one of the six frame  
-> translations of the entire sequence, using the TranslationDB's annodb  
-> attribute (which is just an annotation DB yielding TranslationAnnot  
-> annotations for the six frames).  I had to choose a naming convention  
-> for the six frame annotations, so I just appended to the sequence ID a  
-> colon and digit indicating what nucleotide the frame begins at (0, 1,  
-> or 2 for positive strand, -0, -1, or -2 for the negative strand).
-> 
-> frame0 = tdb.annodb[seqID + ':0'] # + strand
-> frame1 = tdb.annodb[seqID + ':1']
-> frame2 = tdb.annodb[seqID + ':2']
-> 
-> frame0rc = tdb.annodb[seqID + ':-0'] # - strand
-> frame1rc = tdb.annodb[seqID + ':-1']
-> frame2rc = tdb.annodb[seqID + ':-2']

OK, interesting... I worry that the ':' notation is arbitrary, and
the '-0' notation really rubs me the wrong way for hopefully obvious
mathematical reasons ;).  What's wrong with +1, +2, +3, -1, -2, -3?

-> The negative frames translate the negative strand interval (reverse  
-> complement) of the corresponding positive frame interval.  i.e.  
-> comparing their underlying nucleotide interval objects
-> 
-> frame0rc.sequence == -(frame0.sequence)
-> is True

I'm so confused about frame calculations that I can't think straight
about this any more -- does that match the BLAST frame calculations?
Perhaps I was too mired in the BLAST code thought process to separate
concerns, but I feel like we should adhere to NCBI's standards.  (My
tests of 'seq.translation()' do so.)

tnx,
--titus
-- 
C. Titus Brown, [email protected]

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[pygr] seq.translation() (was: blast issues to decide)

Reply via email to