Hello Rani, Looking at the NCBI records for your gene (see the very end):
http://www.ncbi.nlm.nih.gov/nuccore/9257219?from=370&to=371&report=gbwithparts I see 2 variants (NM_000804 and NM_000804.2) which seem to differ in the exon you found which would explain the difference you're seeing. Generally speaking, the refgene table is composed of gene annotations produced by aligning the refseq mRNA/RNAs to the genome. Genome annotations only contain genomic coordinates, and so for those created from an alignment, insertions/deletions in the query sequence are lost. The pairwise alignments, however, preserve this information. So to really understand the base level differences between the mRNA and the genome, you might want to look at the alignments, which are in the refSeqAli table. (Also note: gaps of <= 8 bases are closed to attempt to create a valid gene model. The gap closing is not optimal.) Best regards, Pauline Fujita UCSC Genome Bioinformatics Group http://genome.ucsc.edu On 9/29/10 3:19 AM, [email protected] wrote: > Hello, > > I have downloaded refGene table form the RefSeq Genes track (hg18) and > found the following problem: For hundreds of protein-coding > transcripts, the length of the coding region is not a whole > multiplication of triplets. > > For one example I checked transcript NM_000804. According to NCBI > nucleotide DB record for this transcript, the coding length is 738 > (which is fine: 738=246*3); but calculating coding length region > according to the coordinates provided in refGene, the length is 736. > > To understand where the difference comes from, I compared exons’ > lengths and found that the problem is in exon3: there is a difference > of 2 nucleotides in that exon – see below. > > Tx=NM_000804, (chr11) > > NCBI nucleotide DB info > http://www.ncbi.nlm.nih.gov/nuccore/9257219 > ============================================= > exon1 1..44 Len=44 > exon2 45..218 Len=174 > exon3 219..407 Len=189 > exon4 408..543 Len=136 > exon5 544..847 Len=304 > > CDS 51..788 Len=738 > polyA_site 847 > > > RefSeq table downloaded from UCSC > ======================================= > exon1 len=44, exS=71524418, exE=71524462 > exon2 len=174, exS=71524640, exE=71524814 > exon3 len=187 exS=71527654, exE=71527841 <----- (len is 187 instead of 189) > exon4 len=136 exS=71528038, exE=71528174 > exon5 len=304, exS=71528278, exE=71528582 > > 5utrL=50, cdsL=736, 3utrL=59, mRNA_L=845 > ------------------------------------------------------ > > • Could you please check why for many protein-coding transcripts, the > length of the coding region is not a whole multiplication of triplets. > > • Another problem that I encountered when calculating exons’ lengths > was that in order to get the correct length (according to NCBI > nucleotide DB), one has to calculate (exonEnd – exonS) rather than > what I expected: (exonEnd – exonS +1). It seems that exonS positions > (but not exonsEnd ones) are (-1) shifted. Is this indeed the case? > > Many thanks in advance, > Rani > > > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
