Hello, I have downloaded refGene table form the RefSeq Genes track (hg18) and found the following problem: For hundreds of protein-coding transcripts, the length of the coding region is not a whole multiplication of triplets.
For one example I checked transcript NM_000804. According to NCBI nucleotide DB record for this transcript, the coding length is 738 (which is fine: 738=246*3); but calculating coding length region according to the coordinates provided in refGene, the length is 736. To understand where the difference comes from, I compared exons’ lengths and found that the problem is in exon3: there is a difference of 2 nucleotides in that exon – see below. Tx=NM_000804, (chr11) NCBI nucleotide DB info http://www.ncbi.nlm.nih.gov/nuccore/9257219 ============================================= exon1 1..44 Len=44 exon2 45..218 Len=174 exon3 219..407 Len=189 exon4 408..543 Len=136 exon5 544..847 Len=304 CDS 51..788 Len=738 polyA_site 847 RefSeq table downloaded from UCSC ======================================= exon1 len=44, exS=71524418, exE=71524462 exon2 len=174, exS=71524640, exE=71524814 exon3 len=187 exS=71527654, exE=71527841 <----- (len is 187 instead of 189) exon4 len=136 exS=71528038, exE=71528174 exon5 len=304, exS=71528278, exE=71528582 5utrL=50, cdsL=736, 3utrL=59, mRNA_L=845 ------------------------------------------------------ • Could you please check why for many protein-coding transcripts, the length of the coding region is not a whole multiplication of triplets. • Another problem that I encountered when calculating exons’ lengths was that in order to get the correct length (according to NCBI nucleotide DB), one has to calculate (exonEnd – exonS) rather than what I expected: (exonEnd – exonS +1). It seems that exonS positions (but not exonsEnd ones) are (-1) shifted. Is this indeed the case? Many thanks in advance, Rani _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
