Hi,

Using hg19 RefSeq gene model (from Table Browser, Genes+Prediction group;
RefSeq Genes track; table: refGene; output format: GTF) returns for example:

chrX    hg19_refGene    start_codon     76709647        76709649
0.000000        +       .       gene_id "NM_003868"; transcript_id
"NM_003868";
chrX    hg19_refGene    CDS     76709647        76709751
0.000000        +       0       gene_id "NM_003868"; transcript_id
"NM_003868";
chrX    hg19_refGene    exon    76709647        76709751
0.000000        +       .       gene_id "NM_003868"; transcript_id
"NM_003868";
chrX    hg19_refGene    CDS     76711768        76712010
0.000000        +       0       gene_id "NM_003868"; transcript_id
"NM_003868";
chrX    hg19_refGene    stop_codon      76712011        76712013
0.000000        +       .       gene_id "NM_003868"; transcript_id
"NM_003868";
chrX    hg19_refGene    exon    76711768        76712013
0.000000        +       .       gene_id "NM_003868"; transcript_id
"NM_003868";
which incorrectly indicates that the start codon in the first three bases
on the first aligned CDS exon.

In fact, in cases like there, the first exon is not aligned to hg19, so the
'first' CDS exon that appears in the hg19 alignment is actually midway
through the coding sequence:

http://genome.ucsc.edu/cgi-bin/hgc?hgsid=240218517&g=htcCdnaAli&i=NM_003868&c=chrX&l=76709054&r=76712605&o=76709646&aliTable=refSeqAli&table=refGene

Why are such partial coding alignments included in gene models?

If they are intentionally included, it seems minimally the 'start_codon'
entry in the gene model should be removed to avoid inaccurate inferences
based on the assumption that the start codon is actually at that location.
Is there a way to determine which refGene alignments do not have an aligned
CDS start in the reference genome?

Dan
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to