Hello,
The reason that we do not populate the refGene.name2 field as the 
gene_id in GTF format is that the value does not meet the definition of 
the field. Gene_id is intended to represent a specific genomic location 
for a gene where the associated transcript_ids are transcripts mapped to 
that same gene_id locus. This is not true for the RefSeq dataset. Some 
of this may be due to the independent mapping of RefSeqs to the genome 
by BLAT, but an examination of the RefSeq source data at NCBI also 
reveals some complications.

One example is this pair below, where the same refGene.name2 value 
("gene" field from the Genbank record) is assigned to two RefSeqs that 
do not map to the same location. It appears that the same gene 
symbol/name has been assigned to two distinct proteins.

PRG2 at chr11:56911410-56914706 
<http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr11:56911410-56914706&hgsid=132050741&refGene=full&hgFind.matches=NM_002728,>
 - (NM_002728) proteoglycan 2 preproprotein
PRG2 at chr19:763518-772952 
<http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr19:763518-772952&hgsid=132050741&refGene=full&hgFind.matches=NM_024888,>
 - (NM_024888) plasticity-related protein 2

There are other examples that include the following cases:
1) RefSeq maps multiple times with exact or near exact similarity to 
multiple genomic locations. This is true even if NR_ RefSeqs are not 
included (some cover/represent known repeat regions) or when the 
"random" and "hap" chromosomes alignments are discarded.
2) RefSeqs with different refGene.name2 values map to the same genomic 
location

The GTF file format definition is explained here: 
http://genome.ucsc.edu/FAQ/FAQformat#format4 Click through to the 
http://genes.cse.wustl.edu/GTF2.html link for more details and scroll 
down to find this:
> *[attributes]*
> All four features have the same two mandatory attributes at the end of 
> the record:
>
>    * /gene_id value;/     A globally unique identifier for the genomic
>      source of the transcript
>    * /transcript_id value;/     A globally unique identifier for the
>      predicted transcript.
>
> These attributes are designed for handling multiple transcripts from 
> the same genomic region. Any other attributes or comments must appear 
> after these two and will be ignored. 
Some suggested work-arounds to make the file more useful for your own 
research (if this constraint is not required):
1) Download the regular refGene file along with the GTF format and use 
your own tools to replace the gene_id value with refGene.name2.
2) Add an additional, custom attribute field in the GTF file for the 
refGene.name2 value (may not be accepted/read by all tools, as noted 
above) but may be useful for you anyway.

We hope this helps,
Jennifer Jackson
UCSC Genome Bioinformatics Group

Muller, Matthew wrote:
> Dear Helpdesk,
>
> I notice that the GTF exports from the Human refGene table export 
> (http://genome.ucsc.edu/cgi-bin/hgTables?org=Human&db=hg18&hgsid=131892923&hgta_doMainPage=1)
>  places the RefSeq ID into both the 'gene_id' and 'transcript_id' fields:
>
> chr4    hg18_refGene   ... gene_id "NM_001042402"; transcript_id 
> "NM_001042402";
> chr4    hg18_refGene   ... gene_id "NM_014435"; transcript_id "NM_014435";
>
> The gene_id and transcript_id fields are redundant.  However, I notice that 
> the ID column of the refGene table contains the HUGO name of the gene 
> associated witht the RefSeq.  Should the gene_id field contain that value 
> instead?  This will make the RefSeq GTF export much more useful.
>
> Thanks for your help.
>
> Matthew Muller
> Life Technologies
>
>
>
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>   
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to