Hello Maria,

UCSC Genes (as currently constructed) is expected to contain more genes 
than RefSeq genes.

The UCSC Genes track includes all of RefSeq Genes (at the time of the 
track build) plus CCDS and UniProt (with confirming information from 
Genbank). See the UCSC Genes track description for the full 
content/methods information.

I am not clear about how you are exactly counting up genes, but this is 
what you should do for each to be consistent with the stated track methods:

unique refGene.name2 = number of genes in RefSeq Genes
unique knownCanonical.clusterId = number of genes in UCSC Genes

Using knownGene.protein_ID can be problematic, as some records do not 
have data for that field. Link in information from kgXref and/or kgAlias 
to get various gene handles for each transcript, including those with no 
knownGene.protein_ID data. Please note that UCSC Genes is not clustered 
using gene names, but alignment information (described in track 
description), so you may find mixed gene name/symbol data per "cluster". 
Which is correct would be for you to determine - it is a judgment call 
best done by a person. How you prefer clustering is likely linked to 
what you are doing with the data downstream.

Best of luck with your project,
Jennifer

---------------------------------
Jennifer Jackson
UCSC Genome Informatics Group
http://genome.ucsc.edu/

On 4/6/10 11:41 AM, Maria Poptsova wrote:
> Hi Jennifer,
>
> I have one question about number of unique genes (no isoforms) in both
> tables: RefSeq and knownGenesUCSC. I downloaded both tables for hg18.
> Both contain information about isoforms. I need an information of how
> many unique genes are in each table.
>
> RefSeq table has 34 074 entries. If I extract unique ID, I get 21 497
> entries.
> knownGenes has 66 804 entries. If I extract unique protein_ID, I get 36
> 857 entries.
>
> I am curious, is it a real difference in a number of genes in both table
> (~15 000) or I miss something?
>
> Thank you for your help,
> Maria
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to