Hello,
The reason that we do not populate the refGene.name2 field as the
gene_id in GTF format is that the value does not meet the definition of
the field. Gene_id is intended to represent a specific genomic location
for a gene where the associated transcript_ids are transcripts mapped to
that same gene_id locus. This is not true for the RefSeq dataset. Some
of this may be due to the independent mapping of RefSeqs to the genome
by BLAT, but an examination of the RefSeq source data at NCBI also
reveals some complications.
One example is this pair below, where the same refGene.name2 value
("gene" field from the Genbank record) is assigned to two RefSeqs that
do not map to the same location. It appears that the same gene
symbol/name has been assigned to two distinct proteins.
PRG2 at chr11:56911410-56914706
<http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr11:56911410-56914706&hgsid=132050741&refGene=full&hgFind.matches=NM_002728,>
- (NM_002728) proteoglycan 2 preproprotein
PRG2 at chr19:763518-772952
<http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr19:763518-772952&hgsid=132050741&refGene=full&hgFind.matches=NM_024888,>
- (NM_024888) plasticity-related protein 2
There are other examples that include the following cases:
1) RefSeq maps multiple times with exact or near exact similarity to
multiple genomic locations. This is true even if NR_ RefSeqs are not
included (some cover/represent known repeat regions) or when the
"random" and "hap" chromosomes alignments are discarded.
2) RefSeqs with different refGene.name2 values map to the same genomic
location
The GTF file format definition is explained here:
http://genome.ucsc.edu/FAQ/FAQformat#format4 Click through to the
http://genes.cse.wustl.edu/GTF2.html link for more details and scroll
down to find this:
> *[attributes]*
> All four features have the same two mandatory attributes at the end of
> the record:
>
> * /gene_id value;/ A globally unique identifier for the genomic
> source of the transcript
> * /transcript_id value;/ A globally unique identifier for the
> predicted transcript.
>
> These attributes are designed for handling multiple transcripts from
> the same genomic region. Any other attributes or comments must appear
> after these two and will be ignored.
Some suggested work-arounds to make the file more useful for your own
research (if this constraint is not required):
1) Download the regular refGene file along with the GTF format and use
your own tools to replace the gene_id value with refGene.name2.
2) Add an additional, custom attribute field in the GTF file for the
refGene.name2 value (may not be accepted/read by all tools, as noted
above) but may be useful for you anyway.
We hope this helps,
Jennifer Jackson
UCSC Genome Bioinformatics Group
Muller, Matthew wrote:
> Dear Helpdesk,
>
> I notice that the GTF exports from the Human refGene table export
> (http://genome.ucsc.edu/cgi-bin/hgTables?org=Human&db=hg18&hgsid=131892923&hgta_doMainPage=1)
> places the RefSeq ID into both the 'gene_id' and 'transcript_id' fields:
>
> chr4 hg18_refGene ... gene_id "NM_001042402"; transcript_id
> "NM_001042402";
> chr4 hg18_refGene ... gene_id "NM_014435"; transcript_id "NM_014435";
>
> The gene_id and transcript_id fields are redundant. However, I notice that
> the ID column of the refGene table contains the HUGO name of the gene
> associated witht the RefSeq. Should the gene_id field contain that value
> instead? This will make the RefSeq GTF export much more useful.
>
> Thanks for your help.
>
> Matthew Muller
> Life Technologies
>
>
>
> _______________________________________________
> Genome maillist - [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>
_______________________________________________
Genome maillist - [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome