Hi Mike, The IDs starting with NM_* are RefSeq IDs. These come directly from genbank. The format is like:
nucleotide sequences: NM _ XXXXX.NN protein seqences: NP_XXXXXX.NN Where the X's are a string of numbers and the N's are a version number. Click through one of these in the Browser to see the Genbank data sheet for these sequences at NCBI. The RefSeq sequences are not exactly clustered by gene from NCBI, although variants are noted by text descriptions here. Many groups (including the UCSC Bioinformatics team) take in this data and do some clustering. The track in the UCSC Browser with this information is the UCSC Gene track. It includes sequences from several sources, including the RefSeqs from NCBI, arranged to create a comprehensive, non-redundant, version of the transcriptome/proteome. This will not be as complete as fly (since it is "complete") but it is the best view to date. For this track, the actual nucleotide transcript sequences are given a special unique identifier, but this is mapped to the nucleotide and protein sources (both the actual used and those rolled in when redundancy was removed) and they are grouped into gene bound clusters. Open the UCSC Gene track and click on the description page to view how the data was created. Also click on one of the data points to view all of the associated data linked in. Bring up the track in the Table browser to view the tables, schema, linked tables, and content details. knownGene - alignment data per transcript knownCanonical - groups transcripts into clusters kgXref - links in all associated IDs kgAlias - another ID linking table (RefSeqs included) refLink, knownToLocusLink - more linked data, including Locus link ID (many other tables linked in) Examine the data and please let us know if you need more help, Jennifer Jackson UCSC Genome Bioinformatics Group Duff wrote: > I have been developing informatics scripts used primarily in our analysis of > RNAseq data for Drosophila. One of the startingpoints for our analysis is a > gene model specified by the UCSC Table browser > in the form of a .BED file, which lists each isoform name (eg. CG1674-RA, > CG1674-RB,...) along with each isoforms' exons' coordinates. The association > between isoform and gene is straightforward from the isoformID/name. > > Lately, I've been attempting to adapt the analysis scripts to Humanexpression > data, and I'm encountering difficulty in locating, or piecing together, a > similar > gene model. I'm trying to work with the most up-to-date (Feb 2009) > annotations, > but the gene/isoform naming convention there seems quite different from that > for fly. For example NM_001145277, NM_001145278, and NM_018090 appear > (judging from txStart & txEnd) to be different isoforms associated with a > common > gene, though there is nothing within the isoform names themselves to > indicate > a common gene (and using common txStart/Ends to associate isoforms with > common genes would seem, in general, to be incorrect). > > My question is: For Human Feb 2009 annotations, does there exist a table > that > translates from NM_* IDs to an ID-scheme similar to that adopted for fly; > i.e., > a standard gene name followed by an isoform name sub-tag? > > Any suggestions you might have would be appreciated. > > > > -Mike > > > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
