Hi Anton, The process you describe correct. In order to see why you are getting more than 1747 unique gene symbols, I tried a few things here on a list of all the gene symbols from the kgXref table that I retrieved from the table browser and saved to a file I called geneSymbols. I then used some linux utilities to find the number of unique gene symbols in the file. I found that if I don't sort my file first, I get more "unique" items. For example:
$ cat geneSymbols | uniq | wc -l 29323 where as: $ cat geneSymbols | sort | uniq | wc -l 29289 Are you using these utilities? If so, please make sure you are sorting the file first. Also, I just wanted to let you know that an updated version of the mm9 UCSC Genes track is currently under active quality assurance review, and we hope to release it in the coming months. You can preview it at genome-preview.ucsc.edu, but please be extremely cautious with using this and all unreleased data on the preview machine since it has not yet completed our quality assurance process. I wanted to mention it, however, because this updated version find a match for more of your gene symbols than the current UCSC Genes. Additionally, the RefSeqGenes track is kept up to date by an automatic process. So using the refFlat table (I don't believe you'll need to use any fields from any additional tables) might also yield matches for a higher number of your gene symbols. I hope this information is helpful. Please contact the mail list ([email protected]) again if you have any further questions. Katrina Learned UCSC Genome Bioinformatics Group Anton Kratz wrote, On 04/14/11 22:08: > Dear UCSC team, > > I have a list of 2343 unique Gene Symbols for mouse. > > What I want are two lists: > > One list with at least one entry of chromosome, strand, start, end for each > of the 2343 Gene Symbols on mm9. > > Another list with *all* Gene Symbols for mouse, together with again at least > one coordinate information consisting of chromosome, strand, start, end on > mm9. > > What is the easiest way to retrive that? > > I thought I know how to do that using the Table Browser but I run into > inconsistencies which make me unsure if I'm doing it correctly: > > What I tried: I used the Table Browser with the kgXref table for mouse. > I upload my list of 2343 unique Gene Symbols for mouse. > Then I get a box: > *Error(s):* > > - Note: 596 of the 2343 given identifiers (e.g. Tm4sf12) have no match in > table kgXref, field kgID or in alias table kgAlias, field alias. Try the > "describe table schema" button for more information about the table and > field. > > So I would expect 1747 Gene Symbols to be left (this seems to be wrong, see > below). > > I select "selected fields from primary and related tables" and press "get > output". > > On the new page, under "Linked Tables", I check "mm9knownGeneGenes based on > RefSeq, GenBank, and UniProt." and press "Allow selection from checked > tables". I do this because I don't see a way to directly retrieve the > coordinates of the Gene Symbols, so I try to use the known genes as a kind > of intermediate (is there a direct way?). > > Under "Select Fields from mm9.kgXref" I check geneSymbol, and under > "mm9.knownGene fields" I check name, chrom, strand, txStart, txEnd. Now I > press "Get Output". > > The result is a file with 3866 entries, 1809 of them unique. I wonder why > 1809 and not 1747?! > > Thanks, > Anton > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
