Re: [Genome] length of coding regions

Pauline Fujita Tue, 05 Oct 2010 16:08:59 -0700

Hello Rani,

Looking at the NCBI records for your gene (see the very end):


http://www.ncbi.nlm.nih.gov/nuccore/9257219?from=370&to=371&report=gbwithparts

I see 2 variants (NM_000804 and NM_000804.2) which seem to differ in the 
exon you found which would explain the difference you're seeing. 
Generally speaking, the refgene table is composed of gene annotations 
produced by aligning the refseq mRNA/RNAs to the genome.
Genome annotations only contain genomic coordinates, and so for those 
created from an alignment, insertions/deletions in the query sequence 
are lost.  The pairwise alignments, however, preserve this information. 
So to really understand the base level differences between the mRNA and 
the genome, you might want to look at the alignments, which are in the 
refSeqAli table. (Also note: gaps of <= 8 bases are closed to attempt to 
create a valid gene model. The gap closing is not optimal.)

Best regards,

Pauline Fujita

UCSC Genome Bioinformatics Group
http://genome.ucsc.edu




On 9/29/10 3:19 AM, [email protected] wrote:
> Hello,
>
> I have downloaded refGene table form the RefSeq Genes track (hg18) and  
> found the following problem: For hundreds of protein-coding  
> transcripts, the length of the coding region is not a whole  
> multiplication of triplets.
>
> For one example I checked transcript NM_000804. According to NCBI  
> nucleotide DB record for this transcript, the coding length is 738  
> (which is fine: 738=246*3); but calculating coding length region  
> according to the coordinates provided in refGene, the length is 736.
>
> To understand where the difference comes from, I compared exons’  
> lengths and found that the problem is in exon3: there is a difference  
> of 2 nucleotides in that exon – see below.
>
> Tx=NM_000804, (chr11)
>
> NCBI nucleotide DB info
> http://www.ncbi.nlm.nih.gov/nuccore/9257219
> =============================================
>       exon1            1..44          Len=44
>       exon2            45..218        Len=174
>       exon3            219..407       Len=189
>       exon4            408..543       Len=136
>       exon5            544..847       Len=304
>
>       CDS             51..788 Len=738
>       polyA_site      847
>
>
> RefSeq table downloaded from UCSC
> =======================================
> exon1 len=44,  exS=71524418, exE=71524462
> exon2 len=174, exS=71524640, exE=71524814
> exon3 len=187  exS=71527654, exE=71527841 <----- (len is 187 instead of 189)
> exon4 len=136  exS=71528038, exE=71528174
> exon5 len=304, exS=71528278, exE=71528582
>
> 5utrL=50, cdsL=736, 3utrL=59, mRNA_L=845
> ------------------------------------------------------
>
> •     Could you please check why for many protein-coding transcripts, the  
> length of the coding region is not a whole multiplication of triplets.
>
> •     Another problem that I encountered when calculating exons’ lengths  
> was that in order to get the correct length (according to NCBI  
> nucleotide DB), one has to calculate (exonEnd – exonS) rather than  
> what I expected: (exonEnd – exonS +1). It seems that exonS positions  
> (but not exonsEnd ones) are (-1) shifted. Is this indeed the case?
>
> Many thanks in advance,
> Rani
>
>
>
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>   

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] length of coding regions

Reply via email to