Hi Evan, The protein translation you see in the main Genome Browser window is based on the sequence of the reference genome. The knownGenePep sequence is a translation of the transcript in the knownGeneMrna table. If you would rather have the translation that matches the reference genome, you can use knownGeneTxPep. This is more obvious when you read the description of all of these tables (I should have included this before!):
*knownGeneMrna* contains the mRNA sequence that represents each UCSC Genes transcript. If the transcript is based on a RefSeq transcript, then this table contains the RefSeq transcript, including any portions that do not align to the genome. *knownGeneTxMrna* contains mRNA sequences for each UCSC Genes transcript. In contrast to the sequencess in knownGeneMrna, these sequences are derived by obtaining the sequences for each exon from the reference genome and concatenating these exonic sequences. *knownGenePep* contains the protein sequences derived from the knownGeneMrna transcript sequences. Any protein-level annotations, such as the contents of the knownToPfam table, are based on these sequences. *knownGeneTxPep* contains the protein translation (if any) of each mRNA sequence in knownGeneTxMrna. I hope that makes more sense. -- Brooke Rhead UCSC Genome Bioinformatics Group On 7/24/12 10:04 AM, Evan Bai wrote: > Hi, Brooke, > > Thanks again for the quick reply. > The knownGenePep data set is extremely useful. However, I have found > some slight discrepencies between the peptide sequence stored in this > file versus what is shown in the actual genome browser. For example, as > for hg19 genomic position* chr1:1,269,549-1,269,558*, in the browser > window, the peptide sequence for *uc010nyk.2* is *PGCY*. However, in > the actual knownGenePep data table, the recorded sequence is: > "MLGPAVLGLSLWALLHPGTGAPLCLSQQLRMKGDYVLGGLFPLGEAEEAGLRSRTRPSSPVCTRFSSNGLLWALAMKMAVEEINNKSDLLPGLRLGYDLFDTCSEPVVAMKPSLMFLAKAGSRDIAAYCNYTQYQPRVLAVIGPHSSELAMVTGKFFSFFLMPQVSYGASMELLSARETFPSFFRTVPSDRVQLTAAAELLQEFGWNWVAALGSDDEYGRQGLSIFSALAAARGICIAHEGLVPLPRADDSRLGKVQDVLHQVNQSSVQVVLLFASVHAAHALFNYSISSRLSPKVWVASEAWLTSDLVMGLPGMAQMGTVLGFLQRGAQLHEFPQYVKTHLALATDPAFCSALGEREQGLEEDVVGQRCPQCDCITLQNVSAGLNHHQTFSVYAAVYSVAQALHNTLQCNASGCPAQDPVKPWQLLENMYNLTFHVGGLPLRFDSSGNVDMEYDLKLWVWQGSVPRLHDVGRFNGSLRTERLKIRWHTSDNQKPVSRCSRQCQEGQVRRVKGFHSCCYDCVDCEAGSYRQNPDDIACTFCGQDEWSPERSTRCFRRRSRFLAWGEPAVLLLLLLLSLALGLVLAALGLFVHHRDSPLVQASGGPLACFGLVCLGLVCLSVLLFPGQPSPARCLAQQPLSHLPLTGCLSTLFLQAAEIFVESELPLSWADRLSGCLRGPWAWLVVLLAMLVEVALCTWYLVAFPPEVVTDWHMLPTEALVHCRTRSWVSFGLAHATNATLAFLCFLGTFLVRSQ*PGRY*NRARGLTFAMLAYFITWVSFVPLLANVQVVLRPAVQMGALLLCVLGILAAFHLPRCYLLMRQPGLNTPEFFLGGGPGDAQGQNDGNTGNQGKHE" > > I did have notice that there is a C/T snp (rs307377) at chr1:1,269,554. > When the nucleotide is C, the resulting AA would be R; whereas if the > nucleotide is T, the AA would be C. I know that these are very trivial > differences. But they do introduce false positives in my analyses. I > suspect that the browser version is the more updated one, and would > really appreciate it if there is any way to download that version of the > knownGenePep file. > > Thank you so much! > Best, > Evan > > > On Jul 23, 2012, at 7:12 PM, Brooke Rhead wrote: > >> Hi Evan, >> >> The protein sequence for UCSC Genes is actually kept in the table >> knownGenePep. (This is an odd case; we generally do not store >> sequence in tables for the Genome Browser.) You can download the >> table from our downloads server: >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGenePep.txt.gz >> or from the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables >> >> Be aware that there is a second table, knownGeneTxPep, that contains a >> slightly different sequence for some of the peptides. There is a >> description of the difference on the hg19 UCSC Genes description page >> (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene): >> >> *knownGenePep* contains the protein sequences derived from the >> knownGeneMrna transcript sequences. Any protein-level annotations, >> such as the contents of the knownToPfam table, are based on these >> sequences. >> >> *knownGeneTxPep* contains the protein translation (if any) of each >> mRNA sequence in knownGeneTxMrna. >> >> I see another question from you with the same subject line from last >> week; I think this response answers both questions. If not, or if you >> have further questions, please write back to us at [email protected]. >> >> -- >> Brooke Rhead >> UCSC Genome Bioinformatics Group >> >> >> On 7/23/12 10:06 AM, Evan Bai wrote: >>> Hi, >>> >>> I have a question regarding retrieving protein fasta sequences from >>> the genome browser. >>> >>> For example, when I searched for "uc010nyk.2" in the browser, clicked >>> on the gene, and then clicked on "Protein (852 aa)", it lead me to >>> this result: >>>> uc010nyk.2 (TAS1R3) length=852 >>> MLGPAVLGLSLWALLHPGTGAPLCLSQQLRMKGDYVLGGLFPLGEAEEAGLRSRTRPSSP >>> VCTRFSSNGLLWALAMKMAVEEINNKSDLLPGLRLGYDLFDTCSEPVVAMKPSLMFLAKA >>> GSRDIAAYCNYTQYQPRVLAVIGPHSSELAMVTGKFFSFFLMPQVSYGASMELLSARETF >>> PSFFRTVPSDRVQLTAAAELLQEFGWNWVAALGSDDEYGRQGLSIFSALAAARGICIAHE >>> GLVPLPRADDSRLGKVQDVLHQVNQSSVQVVLLFASVHAAHALFNYSISSRLSPKVWVAS >>> EAWLTSDLVMGLPGMAQMGTVLGFLQRGAQLHEFPQYVKTHLALATDPAFCSALGEREQG >>> LEEDVVGQRCPQCDCITLQNVSAGLNHHQTFSVYAAVYSVAQALHNTLQCNASGCPAQDP >>> VKPWQLLENMYNLTFHVGGLPLRFDSSGNVDMEYDLKLWVWQGSVPRLHDVGRFNGSLRT >>> ERLKIRWHTSDNQKPVSRCSRQCQEGQVRRVKGFHSCCYDCVDCEAGSYRQNPDDIACTF >>> CGQDEWSPERSTRCFRRRSRFLAWGEPAVLLLLLLLSLALGLVLAALGLFVHHRDSPLVQ >>> ASGGPLACFGLVCLGLVCLSVLLFPGQPSPARCLAQQPLSHLPLTGCLSTLFLQAAEIFV >>> ESELPLSWADRLSGCLRGPWAWLVVLLAMLVEVALCTWYLVAFPPEVVTDWHMLPTEALV >>> HCRTRSWVSFGLAHATNATLAFLCFLGTFLVRSQPGRYNRARGLTFAMLAYFITWVSFVP >>> LLANVQVVLRPAVQMGALLLCVLGILAAFHLPRCYLLMRQPGLNTPEFFLGGGPGDAQGQ >>> NDGNTGNQGKHE A fasta file with the protein sequences for uc010nyk.2 >>> >>> And I wonder how I can download the protein fasta file for all hg19 >>> proteins with UCSC identifier. >>> >>> thank you! >>> >>> Sincerely, Evan Bai Yale University >>> _______________________________________________ Genome maillist - >>> [email protected] >>> https://lists.soe.ucsc.edu/mailman/listinfo/genome >>> > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
