On Thu, Jul 16, 2009 at 11:50:02AM -0700, Christopher Lee wrote: -> On Jul 15, 2009, at 10:37 PM, C. Titus Brown wrote: -> > One of the things I wanted to do with my code was to provide a -> > 'translation' on sequences that would let you translate a DNA sequence -> > into a protein sequence in any specified frame. What do you think? -> -> If I'm understanding your question right, TranslationDB does this -> pretty directly. Say you have a nucleotide sequence database object -> db. Then you can get the translation of any desired frame or subslice -> of a sequence in db as easily as: -> -> tdb = TranslationDB(db) -> # 100 AA beginning at nucleotide i on positive strand -> orf100 = tdb[seqID][i:i+300] -> # 100 AA for negative strand (reverse-comp) of same nt interval -> orf100rc = (-(tdb[seqID]))[-i - 300: -i]
It seems simpler, to me, to do: seq = db[seqID][i:] orf100 = seq.translation()[:100] Here 'translation()' can take any frame, e.g. +1, -1, -3, etc. Check out the simpleframe_ctb branch on my github repo, which implements the 'translation' method. (I also added in a bunch of tests for it in sequence_test.py.) The original motivation was to support code like this, blastx_matches = blastx_map[dna] dna_slice = dna[x:y] frame1 = dna_slice.translation(1) frame1_matches = blastx_matches[frame1] frame2 = dna_slice.translation(2) frame2_matches = blastx_matches[frame2] The problem that I ran into with this code is that slices of sequences (dna_slice as opposed to dna) need to have their frames adjusted, and I appear to be incapable of basic math this month. -> Alternatively, you can directly request one of the six frame -> translations of the entire sequence, using the TranslationDB's annodb -> attribute (which is just an annotation DB yielding TranslationAnnot -> annotations for the six frames). I had to choose a naming convention -> for the six frame annotations, so I just appended to the sequence ID a -> colon and digit indicating what nucleotide the frame begins at (0, 1, -> or 2 for positive strand, -0, -1, or -2 for the negative strand). -> -> frame0 = tdb.annodb[seqID + ':0'] # + strand -> frame1 = tdb.annodb[seqID + ':1'] -> frame2 = tdb.annodb[seqID + ':2'] -> -> frame0rc = tdb.annodb[seqID + ':-0'] # - strand -> frame1rc = tdb.annodb[seqID + ':-1'] -> frame2rc = tdb.annodb[seqID + ':-2'] OK, interesting... I worry that the ':' notation is arbitrary, and the '-0' notation really rubs me the wrong way for hopefully obvious mathematical reasons ;). What's wrong with +1, +2, +3, -1, -2, -3? -> The negative frames translate the negative strand interval (reverse -> complement) of the corresponding positive frame interval. i.e. -> comparing their underlying nucleotide interval objects -> -> frame0rc.sequence == -(frame0.sequence) -> is True I'm so confused about frame calculations that I can't think straight about this any more -- does that match the BLAST frame calculations? Perhaps I was too mired in the BLAST code thought process to separate concerns, but I feel like we should adhere to NCBI's standards. (My tests of 'seq.translation()' do so.) tnx, --titus -- C. Titus Brown, [email protected] --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "pygr-dev" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/pygr-dev?hl=en -~----------~----~----~----~------~----~------~--~---
