Hi Dan,

Thanks for the feedback. One of our engineers took a look and found that 
including the "start codon" in the GTF is a bug in the table browser. 
We've logged it as a bug.

The 5' of this gene is missing in the reference assembly. This is a 
problem that the GRC knows about, see GRC Incident: HG-146. We do 
include truncated gene models (as does Ensembl and Gencode).

To find out exactly what aligned, you can use the refSeqAli table (which 
is in PSL format, http://genome.ucsc.edu/FAQ/FAQformat.html#format2).


Please let us know if you have any additional questions: [email protected]

-
Greg Roe
UCSC Genome Bioinformatics Group



On 1/30/12 2:29 AM, Dan Richards wrote:
> Hi,
>
> Using hg19 RefSeq gene model (from Table Browser, Genes+Prediction group;
> RefSeq Genes track; table: refGene; output format: GTF) returns for example:
>
> chrX    hg19_refGene    start_codon     76709647        76709649
> 0.000000        +       .       gene_id "NM_003868"; transcript_id
> "NM_003868";
> chrX    hg19_refGene    CDS     76709647        76709751
> 0.000000        +       0       gene_id "NM_003868"; transcript_id
> "NM_003868";
> chrX    hg19_refGene    exon    76709647        76709751
> 0.000000        +       .       gene_id "NM_003868"; transcript_id
> "NM_003868";
> chrX    hg19_refGene    CDS     76711768        76712010
> 0.000000        +       0       gene_id "NM_003868"; transcript_id
> "NM_003868";
> chrX    hg19_refGene    stop_codon      76712011        76712013
> 0.000000        +       .       gene_id "NM_003868"; transcript_id
> "NM_003868";
> chrX    hg19_refGene    exon    76711768        76712013
> 0.000000        +       .       gene_id "NM_003868"; transcript_id
> "NM_003868";
> which incorrectly indicates that the start codon in the first three bases
> on the first aligned CDS exon.
>
> In fact, in cases like there, the first exon is not aligned to hg19, so the
> 'first' CDS exon that appears in the hg19 alignment is actually midway
> through the coding sequence:
>
> http://genome.ucsc.edu/cgi-bin/hgc?hgsid=240218517&g=htcCdnaAli&i=NM_003868&c=chrX&l=76709054&r=76712605&o=76709646&aliTable=refSeqAli&table=refGene
>
> Why are such partial coding alignments included in gene models?
>
> If they are intentionally included, it seems minimally the 'start_codon'
> entry in the gene model should be removed to avoid inaccurate inferences
> based on the assumption that the start codon is actually at that location.
> Is there a way to determine which refGene alignments do not have an aligned
> CDS start in the reference genome?
>
> Dan
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to