Hello Maria, UCSC Genes (as currently constructed) is expected to contain more genes than RefSeq genes.
The UCSC Genes track includes all of RefSeq Genes (at the time of the track build) plus CCDS and UniProt (with confirming information from Genbank). See the UCSC Genes track description for the full content/methods information. I am not clear about how you are exactly counting up genes, but this is what you should do for each to be consistent with the stated track methods: unique refGene.name2 = number of genes in RefSeq Genes unique knownCanonical.clusterId = number of genes in UCSC Genes Using knownGene.protein_ID can be problematic, as some records do not have data for that field. Link in information from kgXref and/or kgAlias to get various gene handles for each transcript, including those with no knownGene.protein_ID data. Please note that UCSC Genes is not clustered using gene names, but alignment information (described in track description), so you may find mixed gene name/symbol data per "cluster". Which is correct would be for you to determine - it is a judgment call best done by a person. How you prefer clustering is likely linked to what you are doing with the data downstream. Best of luck with your project, Jennifer --------------------------------- Jennifer Jackson UCSC Genome Informatics Group http://genome.ucsc.edu/ On 4/6/10 11:41 AM, Maria Poptsova wrote: > Hi Jennifer, > > I have one question about number of unique genes (no isoforms) in both > tables: RefSeq and knownGenesUCSC. I downloaded both tables for hg18. > Both contain information about isoforms. I need an information of how > many unique genes are in each table. > > RefSeq table has 34 074 entries. If I extract unique ID, I get 21 497 > entries. > knownGenes has 66 804 entries. If I extract unique protein_ID, I get 36 > 857 entries. > > I am curious, is it a real difference in a number of genes in both table > (~15 000) or I miss something? > > Thank you for your help, > Maria _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
