Hi Mathias, The paper you're citing describes a previous version UCSC Gene build methods. Please read the current methods on the UCSC Genes description page: http://genome.ucsc.edu/cgi-bin/hgTrackUi?g=knownGene
If that doesn't clear up the reason why you're seeing multiple gene ids with the same protein ID, please let us know: [email protected] Thoguh Fan Hsu pointed out that you may want to consider using the current knownGenePep table, which contains predicted AA sequences derived directly from the DNA sequences of UCSC Genes based on reference genome. - Greg Roe UCSC Genome Bioinformatics Group On 8/8/11 12:56 AM, Kuhring, Mathias wrote: > Hi everybody, > > I downloaded the knownGene table with the table browser (default settings) to > use the annotations for some protein mapping. > > Hsu's paper (The UCSC Known Genes, 2006) says "mRNA with the highest score is > selected as the representative mRNA for the protein" and "removing duplicates > having identical chromosome number, start and ending positions of coding > sequence" > > So actually I expected one gene per protein (uniprot id) but i found a couple > of genes coding for the same protein. > This is causing some trouble, because now I'm not sure which one to take for > my protein. > > The "redundant" (?) genes almost share the same loci and/or cds, but seem to > differ in number of exons and/or splice sites. > So I'm afraid they don't code for the same amino acid sequence, which i > thought usally leads to different proteins (or are there many exceptions?). > > Here is an example (I attached some more): > #name chrom strand txStart txEnd cdsStart cdsEnd exonCount > exonStarts exonEnds proteinID alignID > uc001lqz.2 chr11 + 747431 765023 747481 764845 8 > 747431,755878,758949,760121,763343,763746,764287,764812, > 747578,756002,759057,760253,763519,763944,764433,765023, P37837 > uc001lqz.2 > uc001lra.2 chr11 + 747431 765023 747481 764413 8 > 747431,755878,758949,760121,763343,763746,764287,764812, > 747578,756002,759057,760253,763519,763940,764433,765023, P37837 > uc001lra.2 > > The second gene's cdsEnd is smaller and exon 6 ends 4 positions earlier but > is still in the cds. > I had a look a the sequences and I there is a shift. So I think the > propability to code for the same protein is pretty low. > > But how should I handle those genes now? Do they code for protein isoforms, > which didn't get a unique protein id yet? > May i got Hsu's paper wrong? Or did I just miss some information somewhere? > > I hope you can help me with this. Thanks a lot. > > Greetinz, > Mathias > > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
