Hi Greg, thank you very much for the answer and the hints.
But I'm afraid the current UCSC Genes description doesn't help me to really understand this too. At least it doesn't say that they only keep one mRNA aligment per protein, so this just might be the reason. As i had some trouble to assign my peptides to some isoforms, i decided to have a look at all proteins/genes with the same protein ids anyways (including the isoforms). Then i just take the first gene my peptide matches to. As i'm only interessted in the peptides positions in the genome, this should be sufficient for me. I'll have a look at knownGenePep table. It might improve my mapping performance but i think it won't help me with the protein gene relationship problem due to the identical protein ids. As my peptides are already assigned to a protein with an id, i use this id to restrict the peptides origin in the genome. So the AA sequences only are not useful in my first steps. But as writen above, i solved this problem, at least sufficient enough for me. So thank you again for your help. Greetings, Mathias -----Ursprüngliche Nachricht----- Von: Greg Roe [mailto:[email protected]] Gesendet: Samstag, 13. August 2011 00:53 An: Kuhring, Mathias Cc: [email protected] Betreff: Re: [Genome] Different genes with same protein id Hi Mathias, The paper you're citing describes a previous version UCSC Gene build methods. Please read the current methods on the UCSC Genes description page: http://genome.ucsc.edu/cgi-bin/hgTrackUi?g=knownGene If that doesn't clear up the reason why you're seeing multiple gene ids with the same protein ID, please let us know: [email protected] Thoguh Fan Hsu pointed out that you may want to consider using the current knownGenePep table, which contains predicted AA sequences derived directly from the DNA sequences of UCSC Genes based on reference genome. - Greg Roe UCSC Genome Bioinformatics Group On 8/8/11 12:56 AM, Kuhring, Mathias wrote: Hi everybody, I downloaded the knownGene table with the table browser (default settings) to use the annotations for some protein mapping. Hsu's paper (The UCSC Known Genes, 2006) says "mRNA with the highest score is selected as the representative mRNA for the protein" and "removing duplicates having identical chromosome number, start and ending positions of coding sequence" So actually I expected one gene per protein (uniprot id) but i found a couple of genes coding for the same protein. This is causing some trouble, because now I'm not sure which one to take for my protein. The "redundant" (?) genes almost share the same loci and/or cds, but seem to differ in number of exons and/or splice sites. So I'm afraid they don't code for the same amino acid sequence, which i thought usally leads to different proteins (or are there many exceptions?). Here is an example (I attached some more): #name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID uc001lqz.2 chr11 + 747431 765023 747481 764845 8 747431,755878,758949,760121,763343,763746,764287,764812, 747578,756002,759057,760253,763519,763944,764433,765023, P37837 uc001lqz.2 uc001lra.2 chr11 + 747431 765023 747481 764413 8 747431,755878,758949,760121,763343,763746,764287,764812, 747578,756002,759057,760253,763519,763940,764433,765023, P37837 uc001lra.2 The second gene's cdsEnd is smaller and exon 6 ends 4 positions earlier but is still in the cds. I had a look a the sequences and I there is a shift. So I think the propability to code for the same protein is pretty low. But how should I handle those genes now? Do they code for protein isoforms, which didn't get a unique protein id yet? May i got Hsu's paper wrong? Or did I just miss some information somewhere? I hope you can help me with this. Thanks a lot. Greetinz, Mathias _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
