Re: [Genome] Human (Feb 2009) gene models: gene/isoform naming convention

Jennifer Jackson Fri, 19 Jun 2009 11:55:11 -0700

Hi Mike,

The IDs starting with NM_* are RefSeq IDs. These come directly from 
genbank. The format is like:

nucleotide sequences: NM _ XXXXX.NN
protein seqences: NP_XXXXXX.NN

Where the X's are a string of numbers and the N's are a version number. 
Click through one of these in the Browser to see the Genbank data sheet 
for these sequences at NCBI. The RefSeq sequences are not exactly 
clustered by gene from NCBI, although variants are noted by text 
descriptions here. Many groups (including the UCSC Bioinformatics team) 
take in this data and do some clustering.

The track in the UCSC Browser with this information is the UCSC Gene 
track. It includes sequences from several sources, including the RefSeqs 
from NCBI, arranged to create a comprehensive, non-redundant, version of 
the transcriptome/proteome. This will not be as complete as fly (since 
it is "complete") but it is the best view to date. For this track, the 
actual nucleotide transcript sequences are given a special unique 
identifier, but this is mapped to the nucleotide and protein sources 
(both the actual used and those rolled in when redundancy was removed) 
and they are grouped into gene bound clusters.

Open the UCSC Gene track and click on the description page to view how 
the data was created. Also click on one of the data points to view all 
of the associated data linked in. Bring up the track in the Table 
browser to view the tables, schema, linked tables, and content details.

knownGene - alignment data per transcript
knownCanonical - groups transcripts into clusters
kgXref - links in all associated IDs
kgAlias - another ID linking table (RefSeqs included)
refLink, knownToLocusLink - more linked data, including Locus link ID
(many other tables linked in)

Examine the data and please let us know if you need more help,
Jennifer Jackson
UCSC Genome Bioinformatics Group

Duff wrote:
> I have been developing informatics scripts used primarily in our analysis of
> RNAseq data for Drosophila. One of the startingpoints for our analysis is a
> gene model specified by the UCSC Table browser
> in the form of a .BED file, which lists each isoform name (eg. CG1674-RA,
> CG1674-RB,...) along with each isoforms' exons' coordinates. The association
> between isoform and gene is straightforward from the isoformID/name.
>
> Lately, I've been attempting to adapt the analysis scripts to Humanexpression
> data, and I'm encountering difficulty in locating, or piecing together, a
> similar
> gene model. I'm trying to work with the most up-to-date (Feb 2009)
> annotations,
> but the gene/isoform naming convention there seems quite different from that
> for fly. For example NM_001145277, NM_001145278, and NM_018090 appear
> (judging from txStart & txEnd) to be different isoforms associated with a
> common
> gene, though there is nothing within the isoform names themselves to
> indicate
> a common gene (and using common txStart/Ends to associate isoforms with
> common genes would seem, in general, to be incorrect).
>
> My question is: For Human Feb 2009 annotations, does there exist a table
> that
> translates from NM_*  IDs to an ID-scheme similar to that adopted for fly;
> i.e.,
> a standard gene name followed by an isoform name sub-tag?
>
> Any suggestions you might have would be appreciated.
>
>
>
> -Mike
>
>
>   
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Human (Feb 2009) gene models: gene/isoform naming convention

Reply via email to