Re: [Genome] Regarding extracting first codon start (translation start) and codon stop (translation stop) for genes

Mary Goldman Thu, 14 Oct 2010 14:02:22 -0700

Hi Jim,

The transcription start is where the mRNA begins (i.e. the beginning of 
the first exon). The translation start is where the coding sequence 
starts (i.e. the beginning of the protein). As you know, there is often 
sequence between the transcription start and translation start that is 
called the UnTranslated Region or UTR. Note the UTRs are considered to 
be part of exons, thus the CDS start and CDS end can occur mid exon.


If the translation start is annotated on the mRNA and that part of the 
mRNA aligns cleanly to the genome, then we put that coordinate in the 
CDS start field. If the annotated CDS region doesn't include a proper 
start codon, or if the start codon doesn't align cleanly to the 
reference genome, then our CDS start is the first complete codon that 
can be aligned to the genome and has the most evidence from other 
sources (i.e. protein evidence, etc). Thus, the CDS start is the 
beginning of translation for UCSC Genes.

The last thing I would like to mention is that, regardless of strand, 
the value in the "start" field is always the lowest coordinate and the 
value in the "end" field is always the highest coordinate. This may 
affect your translation of codons for genes on the negative strand.

I hope this information is helpful.  Please feel free to contact the 
mail list again if you require further assistance.

Best,
Mary
------------------
Mary Goldman
UCSC Bioinformatics Group

On 10/13/10 9:51 AM, James Lyons-Weiler wrote:
> Mary and Pauline,
>
> Thank you Pauline for your message re: the intron/exon boundaries and the
> 0/1 counting issue...
>
> however...
>
> I'm sorry... did I miss the answers to my questions?
>
> Rahil is following the advice of Pauline to start translating mRNA from
> the CDS start, not the annotated transcription start site, and, for the
> life of me, I can't imagin why that is a good idea.  Am I missing
> something?
>
> If it isn't too much trouble, Please provide answers to my questions
> below.  We have a project awaiting annotation and our manual validation
> step is stuck on this issue.
>
> Thanks,
> jim lyons-weiler
> director
> bioinformatics analysis core
> pitt
>
>
>    
>> mary...
>>
>> this use of cds start issue has been very confusing to us here.  maybe you
>> can help with additional details.
>>
>> what does it mean to 'use the cds start as the start codon'... in terms of
>> algorithms, please? do you mean a literal translation from that codon,
>> whether the 1st triplet of the cds is atg or not?
>>
>> what are the consequences of using the cds start as the start codon when
>> the transcription start codon in known and annotated and should be used,
>> instead?
>>
>> mary, fyi we are using our own translator, not any ucsc software.  is the
>> ucsc software programmed to anticipate the cds start as the 'start' codon
>> but still return the translation of the annotated transcript or something?
>>
>> jlw
>> director
>> bioinformatics analysis core
>> pitt
>>
>>
>>      
>>> Hi Mary,
>>>
>>> Thanks for the answer but I would like to know why the transaltion
>>> result
>>> for many genes come out to be different when CDS start mentioned in USCS
>>> Genome Browser Table is considered to be the translation start?
>>> For example I downloaded a complete chromosome 1 of hg19 from UCSC
>>> Genome
>>> Browser ftp downloads whole genome and then I carefully extracted the
>>> all
>>> the exonic regions starting from base at CDS start till base at CDS end
>>> for a gene/transcript(uc010nya.1), the exon start and end positions and
>>> CDS start and end obtained from UCSC. Then I translated those regions
>>> assuming that reading frame begins from CDS start(translation start) and
>>> the string of amino acids differ from the protein sequence of uc010nya.1
>>> obtained from UCSC Genome Browser.
>>>
>>> The two sequences are:
>>>
>>>        
>>>> Manual translation from CDS start till CDS end for uc010nya.1
>>>>          
>>> MSESRQTHVTLHDIDPQALDQLVQFAYTAEIVVGEGNVQDSAPSRQSPAA
>>> EWRPRRLLQVSTESARPLQLPGYPGLCRCALLQRPAQGRPQVRAAALRGR
>>> GQDRGVYAAAPETGNSWRAQPSXXXXXXXXLCL*LPTPFCS*HSPAHNP*
>>> CLLCVPETFLDLGPPGASSVAPDSARPLPV*TLSPHLLTX
>>>
>>>        
>>>> uc010nya.1 obtained from UCSC Genome Browser table
>>>>          
>>> MSESRQTHVTLHDIDPQALDQLVQFAYTAEIVVGEGNVQTLLPAASLLQLNGVRDACCKF
>>> LLSQLDPSNCLGIRGFADAHSCSDLLKAAHRYVLQHFVDVAKTEEFMLLPLKQVTAGGPS
>>> PRPPPHPTPVFVFDSRPRFVPDTALPTILSACCVSPRPFWIWAPQEPRLWLLTLLGPSQY
>>> EHSAPTC
>>>
>>>  From first line of the first sequence after "GNVQ" you will start seeing
>>> the deviation from the second sequence.
>>>
>>> Please let me know why does it then differ.
>>>
>>> Thank you,
>>> Rahil Sethi
>>>
>>>        
>>>> Hi Rahil,
>>>>
>>>> Thank you so much for giving the assembly, track and table you were
>>>> using when you encountered your question - it is much appreciated!
>>>>
>>>> UCSC Genes does not have cdsStartStat, cdsEndStat or exonFrames fields
>>>> like most of our gene prediction tracks (more information about why can
>>>> be found in this previous mailing list question:
>>>> https://lists.soe.ucsc.edu/pipermail/genome/2010-September/023585.html).
>>>> This means that you can use the CDS start and CDS end as start and stop
>>>> codons. Please keep in mind that we have made the CDS start equal the
>>>> CDS end for non-coding genes.
>>>>
>>>> I hope this information is helpful.  Please feel free to contact the
>>>> mail list again if you require further assistance.
>>>>
>>>> Best,
>>>> Mary
>>>> ------------------
>>>> Mary Goldman
>>>> UCSC Bioinformatics Group
>>>>
>>>> On 10/11/10 7:29 AM, [email protected] wrote:
>>>>          
>>>>> Hello,
>>>>>
>>>>> I am trying to extract the codon start and codon stop for a set of
>>>>> genes
>>>>> in a given position, from Tables in UCSC Genome Browser. Whenever I
>>>>> click
>>>>> output for Genes and Gene Predictions in a chromosome posiition range,
>>>>> it
>>>>> gives me all the feature of genes like exon start, exon stop, CDS
>>>>> start,
>>>>> CDS stop, but does not give me the codon start (start position of the
>>>>> first codon i.e. translation start) and codon stop (position of stop
>>>>> codon
>>>>> i.e. translation stop).
>>>>>
>>>>> Please let me know how can I get this information?
>>>>>
>>>>> I am using:
>>>>> Genome: Hg19
>>>>> Group: Genes and Gene Prediction Tracks
>>>>> Track: UCSC Genes
>>>>> Table: KnownGene
>>>>> region: defined regions
>>>>>
>>>>> Thank you,
>>>>> Rahil Sethi
>>>>> _______________________________________________
>>>>> Genome maillist  -  [email protected]
>>>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>>>>
>>>>>            
>>>>          
>>>
>>>        
>>
>> --
>> Thank you very much,
>>
>> James Lyons-Weiler
>>
>>
>> Director, Bioinformatics Analysis Core
>> Genomics and Proteomics Core Laboratories
>> Department of Biomedical Informatics
>> University of Pittsburgh Cancer Institute
>> 3rd Floor
>> 3343 Forbes Avenue
>> Pittsburgh, PA 15260
>> phone: 412-728-8743
>> reply-to: [email protected]
>>
>>      
>
> --
> Thank you very much,
>
> James Lyons-Weiler
>
>
> Director, Bioinformatics Analysis Core
> Genomics and Proteomics Core Laboratories
> Department of Biomedical Informatics
> University of Pittsburgh Cancer Institute
> 3rd Floor
> 3343 Forbes Avenue
> Pittsburgh, PA 15260
> phone: 412-728-8743
> reply-to: [email protected]
>    
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Regarding extracting first codon start (translation start) and codon stop (translation stop) for genes

Reply via email to