Thank you Jennifer. I had hoped to piece together an isoform/gene bed-style annotation model from the most recent hg19 Feb 2009 annotations. The linked tables you mention all appear to be available for the previous hg18 March 2006 release only, and I will go with that for now, maybe using clusterIDsas ersatz genes and the knownIsorforms table to map isoforms to "genes." Thanks again.
-Mike On Fri, Jun 19, 2009 at 1:53 PM, Jennifer Jackson <[email protected]> wrote: > Hi Mike, > > The IDs starting with NM_* are RefSeq IDs. These come directly from > genbank. The format is like: > > nucleotide sequences: NM _ XXXXX.NN > protein seqences: NP_XXXXXX.NN > > Where the X's are a string of numbers and the N's are a version number. > Click through one of these in the Browser to see the Genbank data sheet for > these sequences at NCBI. The RefSeq sequences are not exactly clustered by > gene from NCBI, although variants are noted by text descriptions here. Many > groups (including the UCSC Bioinformatics team) take in this data and do > some clustering. > > The track in the UCSC Browser with this information is the UCSC Gene track. > It includes sequences from several sources, including the RefSeqs from NCBI, > arranged to create a comprehensive, non-redundant, version of the > transcriptome/proteome. This will not be as complete as fly (since it is > "complete") but it is the best view to date. For this track, the actual > nucleotide transcript sequences are given a special unique identifier, but > this is mapped to the nucleotide and protein sources (both the actual used > and those rolled in when redundancy was removed) and they are grouped into > gene bound clusters. > > Open the UCSC Gene track and click on the description page to view how the > data was created. Also click on one of the data points to view all of the > associated data linked in. Bring up the track in the Table browser to view > the tables, schema, linked tables, and content details. > > knownGene - alignment data per transcript > knownCanonical - groups transcripts into clusters > kgXref - links in all associated IDs > kgAlias - another ID linking table (RefSeqs included) > refLink, knownToLocusLink - more linked data, including Locus link ID > (many other tables linked in) > > Examine the data and please let us know if you need more help, > Jennifer Jackson > UCSC Genome Bioinformatics Group > > Duff wrote: > >> I have been developing informatics scripts used primarily in our analysis >> of >> RNAseq data for Drosophila. One of the startingpoints for our analysis is >> a >> gene model specified by the UCSC Table browser >> in the form of a .BED file, which lists each isoform name (eg. CG1674-RA, >> CG1674-RB,...) along with each isoforms' exons' coordinates. The >> association >> between isoform and gene is straightforward from the isoformID/name. >> >> Lately, I've been attempting to adapt the analysis scripts to >> Humanexpression >> data, and I'm encountering difficulty in locating, or piecing together, a >> similar >> gene model. I'm trying to work with the most up-to-date (Feb 2009) >> annotations, >> but the gene/isoform naming convention there seems quite different from >> that >> for fly. For example NM_001145277, NM_001145278, and NM_018090 appear >> (judging from txStart & txEnd) to be different isoforms associated with a >> common >> gene, though there is nothing within the isoform names themselves to >> indicate >> a common gene (and using common txStart/Ends to associate isoforms with >> common genes would seem, in general, to be incorrect). >> >> My question is: For Human Feb 2009 annotations, does there exist a table >> that >> translates from NM_* IDs to an ID-scheme similar to that adopted for fly; >> i.e., >> a standard gene name followed by an isoform name sub-tag? >> >> Any suggestions you might have would be appreciated. >> >> >> >> -Mike >> >> >> >> > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
