Re: [Genome] Regarding extracting first codon start (translation start) and codon stop (translation stop) for genes

Mary Goldman Tue, 19 Oct 2010 14:23:31 -0700

Hi Rahil,

I think it might be useful for me to do a demonstration using a small 
protein coding gene on the negative strand, uc009vjr.1. Here is the row 
in the knownGene table for uc009vjr.1:
-----------------+--------+----------+----------+----------+----------+-----------+--------------------+--------------------+--------------+------------+
name     chrom     strand     txStart     txEnd     cdsStart     cdsEnd 
     exonCount     exonStarts     exonEnds     proteinID     alignID
uc009vjr.1     chr1     -     893650     894679     894010     894620 
     2     893650,894594,     894461,894679,         uc009vjr.1


1. Extract the coding region on the positive strand. First match up the 
exonStarts and exonEnds to get each exon (in this case there are 2 
exons: chr1 893650 894461 and chr1 894594 894679). Next, trim the part 
of the exons that extend past the cdsStart and cdsEnd (i.e. the UTRs). 
For this example, this results in chr1 894010 894461 and chr1 894594 
894620 (note that the cdsStart may not occur in the first exon, thus 
some exons may need to be discarded all together). Here is my result:

TTATCTCTCTTCTACCGAACTGCAGGCGGTGATTTCACCCAAGAACGTGA
GAGTTCTCCTAGATCGGGAAGAGATTTTTGCACAACTCACCAACATACGC
TCCCTGCCTAGGACAGAGTTTGGCACGGAACAGGAGCTCAGTAAACATCG
GATGAAAGAGTAAGTTAAGCTGAAAGGACTGGGGGGCAGAGGTCGGCGAT
CCTTAGGCCTTGGCCCTGAGACCCCAGGCGAGGTCAGCAACCCAACCGGG
GTGGGACAGGACGAGCAAGAGGTTCTGCTCACGCATGTCCCCACTAACCT
GGCCGAGGGGCTCCCGCCCGGCTTATCCGGACTCCGGGCAGCCTCGCGTG
CTTCCCGTGTCTCCGCTTGTGGAGAATTTTCGGACTCGGATTCGGACTCG
GAGTCAAAGCCCGAAGCTAGGAACTCGTCCACCGTCAGCTCCGCCAGGCG
C

CTCTTGCGGCTCCCCGCAGCTGCCAT

2. Put the 2 exons above together and reverse complement:

ATGGCAGCTGCGGGGAGCCGCAAGAGGCGCCTGGCGGAGCTGACGGTGGA
CGAGTTCCTAGCTTCGGGCTTTGACTCCGAGTCCGAATCCGAGTCCGAAAATT
CTCCACAAGCGGAGACACGGGAAGCACGCGAGGCTGCCCGGAGTCCGGATA
AGCCGGGCGGGAGCCCCTCGGCCAGGTTAGTGGGGACATGCGTGAGCAGAA
CCTCTTGCTCGTCCTGTCCCACCCCGGTTGGGTTGCTGACCTCGCCTGGGGTCT
CAGGGCCAAGGCCTAAGGATCGCCGACCTCTGCCCCCCAGTCCTTTCAGCTT
AACTTACTCTTTCATCCGATGTTTACTGAGCTCCTGTTCCGTGCCAAACTCTGTC
CTAGGCAGGGAGCGTATGTTGGTGAGTTGTGCAAAAATCTCTTCCCGATCTAGG
AGAACTCTCACGTTCTTGGGTGAAATCACCGCCTGCAGTTCGGTAGAAGAGAG
ATAA

3. Translate

MAAAGSRKRRLAELTVDEFLASGFDSESESESENSPQAETREAREAARSPDKPGGSPSAR
LVGTCVSRTSCSSCPTPVGLLTSPGVSGPRPKDRRPLPPSPFSLTYSFIRCLLSSCSVPN
SVLGRERMLVSCAKISSRSRRTLTFLGEITACSSVEER*

4. Compare to protein sequence from the table browser and see it is the 
same:

MAAAGSRKRRLAELTVDEFLASGFDSESESESENSPQAETREAREAARSPDKPGGSPSAR
LVGTCVSRTSCSSCPTPVGLLTSPGVSGPRPKDRRPLPPSPFSLTYSFIRCLLSSCSVPN
SVLGRERMLVSCAKISSRSRRTLTFLGEITACSSVEER

I hope this information is helpful. Please feel free to contact the mail 
list again if you require further assistance.

Best,
Mary
------------------
Mary Goldman
UCSC Bioinformatics Group

On 10/19/10 9:03 AM, [email protected] wrote:
> Hi Mary,
>
> Thanks for all the help and I apologize to bother you by sending similar
> questions because most of the errors I have made is because I did not have
> the thorough understanding of how the data is represented in UCSC Genome
> Browswer Tables. I am still not clear how do I interpret the coordinate
> values of different features of genes such as CDS start, CDS end, exon
> starts and exon ends if genes are on the negative strand. I did what you
> told me in the previous email and still the translation results for those
> genes are not matching with the protein sequences present in your UCSC
> database.
>
> I also checked the UCSC mailing list and found a reply for converting
> negative strand coordinates to positive strand by subtracting the
> coordinates mentioned for genes on negative strand by the length of the
> chromosome. For my results I did not do that. Instead, I just reverse
> complemented the sequence and applied the coordinate values mentioned in
> the Table for negative strand genes for getting the translated sequence
> assuming those coordinate values are with respect to the reverse
> complemented sequence. For example, when I looked for gene uc001ako.2 in
> UCSC Table I found that it is present on a negative strand and the CDS
> start and CDS end values mentioned are 3547538 and 3566563, respectively.
> I did the translation using the following approaches:
>
> Approach 1:
> 1. I reverse complemented the chromosome 1 sequence (obtained from whole
> genome download of UCSC Genome Browser for hg19)
> 2. I extracted the exon regions from the above reverse complemented
> sequence for uc001ako.2 gene starting from CDS start till CDS end using
> the exon starts and exon ends from UCSC Table (considering CDS starts and
> exon starts as 0 based and their ends as 1 based)
> 3. Then I joined those exon regions and translated.
>
> Approach 2:
> 1. I extracted the exon regions from original chromosome 1 sequence
> (without reverse complementing chromosome sequence) from tx start till tx
> end mentioned for that gene.
> 2. I joined those regions and reverse complemented the resultant sequence
> to get reverse complemented mature mRNA
> 3. I then calculated CDS start and end with respect to above reverse
> complemented resultant sequence ( for example: CDS start - tx start + 1)
> and then translated the reverse complemented sequence from the relative
> CDS start thus obtained (ie. I tranlated the reverse complemented mature
> mRNA from CDS start site with respect to mature mRNA).
>
> Approach 3:
> 1. I extracted a region beginning at tx start till tx end from original
> chromosome 1 sequence (without reverse complementing chromosome sequence)
> for that gene to get a pre-mRNA sequence (tx start and tx end values used
> here were 3547331 (+ 1) and 3566671, respectively)
> 2. Then I reverse complemented the above region (i.e. pre-mRNA sequence)
> 3. Then I extracted the exon regions from the above reverse complemented
> pre-mRNA region by first calculating the exon starts and ends with respect
> to the above reverse complemented region (for example exon start - tx
> start + 1 using the exon start values directly from Table with 0 based
> start adjustment). The exon regions were extracted beginning from CDS
> start with respect to above region.
> 4. I then translated the thus extracted exon regions.
>
> Following are the Translation results through different approaches as well
> as protein sequence from UCSC Genome Browser:
>
>    
>> uc001.ako.2 Approach 1: reverse complementing chr 1 (partial result shown)
>>      
> XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXHDHLVSLSVMLLINSE
> LKNCHCTNIANGVTISKLNCSHL*
>
>    
>> uc001ako.2 Approach 2: reverse complementing mature mRNA (partial
>>      
> sequence shown)
> YSASSPRTASTWLPVSSTG*WSGM*TPFRSFSCTRA*TRSSTSSGRQTRSSSCAPCTSEG
> WCRSGL*SSPNGTAK*T
>
>    
>> uc001.ako.2 Approach 3: reverse complementing pre-mRNA (partial sequence
>>      
> shown)
> LLCKFSPDGKYLVSGGDLGHGQDILGASVXXXXXXXXXXGWRQLGDPAGAVPGQVRGPGR
> YPGQVQCLSTGKWCGYF
>
>    
>> uc001ako.2 from UCSC Genome Browser Table sequence
>>      
> MNFSEVFKLSSLLCKFSPDGKYLASCVQYRLVVRDVNTLQILQLYTCLDQIQHIEWSADS
> LFILCAMYKRGLVQVWSLEQPEWHCKIDEGSAGLVASCWSPDGRHILNTTEFHLRITVWS
> LCTKSVSYIKYPKACLQGITFTRDGRYMALAERRDCKDYVSIFVCSDWQLLRHFDTDTQD
> LTGIEWAPNGCVLAVWDTCLEYKILLYSLDGRLLSTYSAYEWSLGIKSVAWSPSSQFLAV
> GSYDGKVRILNHVTWKMITEFGHPAAINDPKIVVYKEAEKSPQLGLGCLSFPPPRAGAGP
> LPSSESKYEIASVPVSLQTLKPVTDRANPKIGIGMLAFSPDSYFLATRNDNIPNAVWVWD
> IQKLRLFAVLEQLSPVRAFQWDPQQPRLAICTGGSRLYLWSPAGCMSVQVPGEGDFAVLS
> LCWHLSGDSMALLSKDHFCLCFLETEAVVGTACRQLGGHT
>
> Please let me know where am I making mistake.
>
> Note: the X represent where DNA sequence has N's and * represent STOP DNA
> triplet code.
>
> Thank you very much,
> Rahil Sethi
>
>    
>> Hi Jim,
>>
>> The three examples you give are genes that are on the negative strand.
>> After you obtain the mRNA sequence for genes on the negative strand, you
>> will first need to reverse complement the sequence before starting
>> translation. Please note that, as I mentioned in my last email, the
>> value in the "start" field is always the lowest coordinate and the value
>> in the "end" field is always the highest coordinate, regardless of
>> strand. Thus you will still extract the mRNA sequence the same way for
>> genes on both the negative and positive strand. It is only *after*
>> extracting the mRNA sequence that you need to treat genes on the
>> negative strand differently by reverse complementing their sequence.
>>
>> I'm afraid I don't quite understand what you mean by "if codon at CDS
>> start is ATG (i.e. corresponding to M) the translation results for those
>> genes match\100% align with their protein sequences in UCSC database".
>> Could you please add clarification if this is different from the issue
>> above? Thank you.
>>
>> Best,
>> Mary
>> ---------------------
>> Mary Goldman
>> UCSC Bioinformatics Group
>>
>> On 10/14/10 2:10 PM, [email protected] wrote:
>>      
>>> Hi Pauline and Mary,
>>>
>>> Thank you very much for the information regarding the start positions
>>> being 0 based whilst end positions being 1 based. I made that adjustment
>>> and the translation for that gene (uc010nya.1) came out to be exactly
>>> the
>>> same as the protein sequence as mentioned in UCSC database for that
>>> gene.
>>> My question is when I did the translation is the same manner (assuming
>>> CDS
>>> start as codon start\translation start, CDS end codon end with 0 based
>>> starts and 1 based end) the results for some genes/transcripts did not
>>> match with their protein sequences present in UCSC database for example
>>> uc001adj.1, uc001ail.2, uc009vjq.2 etc.
>>>
>>> Again, I used the chr1.fa.masked downloaded for UCSC Genome Browser ftp
>>> Downloads for hg19 for parsing the exon boundaries. Then joined the exon
>>> sequences and began the translation reading frame from CDS start till
>>> CDS
>>> end.
>>>
>>> Below is an example of a part of translation sequence for uc001ail.2
>>> (beginning from CDS start) and the its protein sequence obtained for
>>> UCSC
>>> Genome Browser.
>>>
>>>
>>>        
>>>> uc001ail.2 partially translated sequence beginning from CDS start
>>>>
>>>>          
>>> LGASQHLGYKDDLVGLLHIPP*LQQGRHHEWVVWIKVRRWHPGDADGLRLAALHGAPGGL
>>> NR
>>>
>>>
>>>        
>>>> uc001ail.2 protein sequence from UCSC database
>>>>
>>>>          
>>> MGTGVASMITCSIEGSVLNMGYVIAGESVSSGFKLQNNSLLPIKFSMHLDSLSSTRGRGQ
>>> QQLPQFLSSPSQRTEVVGTQNLNGQSVFSVAPVKGVMDPGKTQDFTVTFSPDHESLYFSD
>>> KLQVVLFEKKISHQILLKGAACQHMMFVEGGDPLDVPVESLTAIPVFDPRHREASSRPGP
>>> LSPEAEELRPILVTLDYIQFDTDTPAPPATRELQVGCIRTTQPSPKKTVEFSIDSVASLQ
>>> HKGFSIEPSRGSVERGQTKTISISWVPPADFDPDHPLMVSALLQLRGDVKETYKVIFVAQ
>>> VLTGP
>>>
>>> Further, for the genes\transcripts that I checked the translation
>>> results,
>>> if codon at CDS start is ATG (i.e. corresponding to M) the translation
>>> results for those genes match\100% align with their protein sequences in
>>> UCSC database but for the other genes their translation result does not
>>> match with UCSC database.
>>>
>>> It will be helpful if you could let me know why for those
>>> genes\transcripts the translation results are different with their
>>> protein
>>> sequences in UCSC Database.
>>>
>>> Thank you,
>>> Rahil Sethi
>>>
>>>
>>>        
>>>> Hello jlw,
>>>>
>>>> Looking at the two protein sequences send in the previous question in
>>>> this thread they seem to diverge right before the end of the first exon
>>>> so I wonder if your program isn't parsing the exon/intron boundaries
>>>> correctly?
>>>>
>>>> Another issue which may affect coordinate calculation - does your
>>>> software take into account UCSCs 0-based start and 1-based end
>>>> coordinate system? Please see this FAQ for more information:
>>>>
>>>> http://genome.ucsc.edu/FAQ/FAQtracks.html#tracks1
>>>>
>>>> Hopefully this information was helpful and answers your question. If
>>>> you
>>>> have further questions or require clarification feel free to contact
>>>> the
>>>> mailing list at [email protected].
>>>>
>>>> Best regards,
>>>>
>>>> Pauline Fujita
>>>>
>>>> UCSC Genome Bioinformatics Group
>>>> http://genome.ucsc.edu
>>>>
>>>>
>>>>
>>>>
>>>> On 10/13/10 5:09 AM, James Lyons-Weiler wrote:
>>>>
>>>>          
>>>>> mary...
>>>>>
>>>>> this use of cds start issue has been very confusing to us here.  maybe
>>>>> you
>>>>> can help with additional details.
>>>>>
>>>>> what does it mean to 'use the cds start as the start codon'... in
>>>>> terms
>>>>> of
>>>>> algorithms, please? do you mean a literal translation from that codon,
>>>>> whether the 1st triplet of the cds is atg or not?
>>>>>
>>>>> what are the consequences of using the cds start as the start codon
>>>>> when
>>>>> the transcription start codon in known and annotated and should be
>>>>> used,
>>>>> instead?
>>>>>
>>>>> mary, fyi we are using our own translator, not any ucsc software.  is
>>>>> the
>>>>> ucsc software programmed to anticipate the cds start as the 'start'
>>>>> codon
>>>>> but still return the translation of the annotated transcript or
>>>>> something?
>>>>>
>>>>> jlw
>>>>> director
>>>>> bioinformatics analysis core
>>>>> pitt
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> Hi Mary,
>>>>>>
>>>>>> Thanks for the answer but I would like to know why the transaltion
>>>>>> result
>>>>>> for many genes come out to be different when CDS start mentioned in
>>>>>> USCS
>>>>>> Genome Browser Table is considered to be the translation start?
>>>>>> For example I downloaded a complete chromosome 1 of hg19 from UCSC
>>>>>> Genome
>>>>>> Browser ftp downloads whole genome and then I carefully extracted the
>>>>>> all
>>>>>> the exonic regions starting from base at CDS start till base at CDS
>>>>>> end
>>>>>> for a gene/transcript(uc010nya.1), the exon start and end positions
>>>>>> and
>>>>>> CDS start and end obtained from UCSC. Then I translated those regions
>>>>>> assuming that reading frame begins from CDS start(translation start)
>>>>>> and
>>>>>> the string of amino acids differ from the protein sequence of
>>>>>> uc010nya.1
>>>>>> obtained from UCSC Genome Browser.
>>>>>>
>>>>>> The two sequences are:
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> Manual translation from CDS start till CDS end for uc010nya.1
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>> MSESRQTHVTLHDIDPQALDQLVQFAYTAEIVVGEGNVQDSAPSRQSPAA
>>>>>> EWRPRRLLQVSTESARPLQLPGYPGLCRCALLQRPAQGRPQVRAAALRGR
>>>>>> GQDRGVYAAAPETGNSWRAQPSXXXXXXXXLCL*LPTPFCS*HSPAHNP*
>>>>>> CLLCVPETFLDLGPPGASSVAPDSARPLPV*TLSPHLLTX
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> uc010nya.1 obtained from UCSC Genome Browser table
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>> MSESRQTHVTLHDIDPQALDQLVQFAYTAEIVVGEGNVQTLLPAASLLQLNGVRDACCKF
>>>>>> LLSQLDPSNCLGIRGFADAHSCSDLLKAAHRYVLQHFVDVAKTEEFMLLPLKQVTAGGPS
>>>>>> PRPPPHPTPVFVFDSRPRFVPDTALPTILSACCVSPRPFWIWAPQEPRLWLLTLLGPSQY
>>>>>> EHSAPTC
>>>>>>
>>>>>>   From first line of the first sequence after "GNVQ" you will start
>>>>>> seeing
>>>>>> the deviation from the second sequence.
>>>>>>
>>>>>> Please let me know why does it then differ.
>>>>>>
>>>>>> Thank you,
>>>>>> Rahil Sethi
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> Hi Rahil,
>>>>>>>
>>>>>>> Thank you so much for giving the assembly, track and table you were
>>>>>>> using when you encountered your question - it is much appreciated!
>>>>>>>
>>>>>>> UCSC Genes does not have cdsStartStat, cdsEndStat or exonFrames
>>>>>>> fields
>>>>>>> like most of our gene prediction tracks (more information about why
>>>>>>> can
>>>>>>> be found in this previous mailing list question:
>>>>>>> https://lists.soe.ucsc.edu/pipermail/genome/2010-September/023585.html).
>>>>>>> This means that you can use the CDS start and CDS end as start and
>>>>>>> stop
>>>>>>> codons. Please keep in mind that we have made the CDS start equal
>>>>>>> the
>>>>>>> CDS end for non-coding genes.
>>>>>>>
>>>>>>> I hope this information is helpful.  Please feel free to contact the
>>>>>>> mail list again if you require further assistance.
>>>>>>>
>>>>>>> Best,
>>>>>>> Mary
>>>>>>> ------------------
>>>>>>> Mary Goldman
>>>>>>> UCSC Bioinformatics Group
>>>>>>>
>>>>>>> On 10/11/10 7:29 AM, [email protected] wrote:
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I am trying to extract the codon start and codon stop for a set of
>>>>>>>> genes
>>>>>>>> in a given position, from Tables in UCSC Genome Browser. Whenever I
>>>>>>>> click
>>>>>>>> output for Genes and Gene Predictions in a chromosome posiition
>>>>>>>> range,
>>>>>>>> it
>>>>>>>> gives me all the feature of genes like exon start, exon stop, CDS
>>>>>>>> start,
>>>>>>>> CDS stop, but does not give me the codon start (start position of
>>>>>>>> the
>>>>>>>> first codon i.e. translation start) and codon stop (position of
>>>>>>>> stop
>>>>>>>> codon
>>>>>>>> i.e. translation stop).
>>>>>>>>
>>>>>>>> Please let me know how can I get this information?
>>>>>>>>
>>>>>>>> I am using:
>>>>>>>> Genome: Hg19
>>>>>>>> Group: Genes and Gene Prediction Tracks
>>>>>>>> Track: UCSC Genes
>>>>>>>> Table: KnownGene
>>>>>>>> region: defined regions
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Rahil Sethi
>>>>>>>> _______________________________________________
>>>>>>>> Genome maillist  -  [email protected]
>>>>>>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>              
>>>>> --
>>>>> Thank you very much,
>>>>>
>>>>> James Lyons-Weiler
>>>>>
>>>>>
>>>>> Director, Bioinformatics Analysis Core
>>>>> Genomics and Proteomics Core Laboratories
>>>>> Department of Biomedical Informatics
>>>>> University of Pittsburgh Cancer Institute
>>>>> 3rd Floor
>>>>> 3343 Forbes Avenue
>>>>> Pittsburgh, PA 15260
>>>>> phone: 412-728-8743
>>>>> reply-to: [email protected]
>>>>> _______________________________________________
>>>>> Genome maillist  -  [email protected]
>>>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>>>>
>>>>>
>>>>>            
>>>>
>>>>          
>>>        
>>      
>    
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Regarding extracting first codon start (translation start) and codon stop (translation stop) for genes

Reply via email to