Hi, Brooke, Thanks again for the quick reply. The knownGenePep data set is extremely useful. However, I have found some slight discrepencies between the peptide sequence stored in this file versus what is shown in the actual genome browser. For example, as for hg19 genomic position chr1:1,269,549-1,269,558, in the browser window, the peptide sequence for uc010nyk.2 is PGCY. However, in the actual knownGenePep data table, the recorded sequence is: "MLGPAVLGLSLWALLHPGTGAPLCLSQQLRMKGDYVLGGLFPLGEAEEAGLRSRTRPSSPVCTRFSSNGLLWALAMKMAVEEINNKSDLLPGLRLGYDLFDTCSEPVVAMKPSLMFLAKAGSRDIAAYCNYTQYQPRVLAVIGPHSSELAMVTGKFFSFFLMPQVSYGASMELLSARETFPSFFRTVPSDRVQLTAAAELLQEFGWNWVAALGSDDEYGRQGLSIFSALAAARGICIAHEGLVPLPRADDSRLGKVQDVLHQVNQSSVQVVLLFASVHAAHALFNYSISSRLSPKVWVASEAWLTSDLVMGLPGMAQMGTVLGFLQRGAQLHEFPQYVKTHLALATDPAFCSALGEREQGLEEDVVGQRCPQCDCITLQNVSAGLNHHQTFSVYAAVYSVAQALHNTLQCNASGCPAQDPVKPWQLLENMYNLTFHVGGLPLRFDSSGNVDMEYDLKLWVWQGSVPRLHDVGRFNGSLRTERLKIRWHTSDNQKPVSRCSRQCQEGQVRRVKGFHSCCYDCVDCEAGSYRQNPDDIACTFCGQDEWSPERSTRCFRRRSRFLAWGEPAVLLLLLLLSLALGLVLAALGLFVHHRDSPLVQASGGPLACFGLVCLGLVCLSVLLFPGQPSPARCLAQQPLSHLPLTGCLSTLFLQAAEIFVESELPLSWADRLSGCLRGPWAWLVVLLAMLVEVALCTWYLVAFPPEVVTDWHMLPTEALVHCRTRSWVSFGLAHATNATLAFLCFLGTFLVRSQPGRYNRARGLTFAMLAYFITWVSFVPLLANVQVVLRPAVQMGALLLCVLGILAAFHLPRCYLLMRQPGLNTPEFFLGGGPGDAQGQNDGNTGNQGKHE"
I did have notice that there is a C/T snp (rs307377) at chr1:1,269,554. When the nucleotide is C, the resulting AA would be R; whereas if the nucleotide is T, the AA would be C. I know that these are very trivial differences. But they do introduce false positives in my analyses. I suspect that the browser version is the more updated one, and would really appreciate it if there is any way to download that version of the knownGenePep file. Thank you so much! Best, Evan On Jul 23, 2012, at 7:12 PM, Brooke Rhead wrote: > Hi Evan, > > The protein sequence for UCSC Genes is actually kept in the table > knownGenePep. (This is an odd case; we generally do not store sequence in > tables for the Genome Browser.) You can download the table from our > downloads server: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGenePep.txt.gz > or from the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables > > Be aware that there is a second table, knownGeneTxPep, that contains a > slightly different sequence for some of the peptides. There is a description > of the difference on the hg19 UCSC Genes description page > (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene): > > *knownGenePep* contains the protein sequences derived from the knownGeneMrna > transcript sequences. Any protein-level annotations, such as the contents of > the knownToPfam table, are based on these sequences. > > *knownGeneTxPep* contains the protein translation (if any) of each mRNA > sequence in knownGeneTxMrna. > > I see another question from you with the same subject line from last week; I > think this response answers both questions. If not, or if you have further > questions, please write back to us at [email protected]. > > -- > Brooke Rhead > UCSC Genome Bioinformatics Group > > > On 7/23/12 10:06 AM, Evan Bai wrote: >> Hi, >> >> I have a question regarding retrieving protein fasta sequences from >> the genome browser. >> >> For example, when I searched for "uc010nyk.2" in the browser, clicked >> on the gene, and then clicked on "Protein (852 aa)", it lead me to >> this result: >>> uc010nyk.2 (TAS1R3) length=852 >> MLGPAVLGLSLWALLHPGTGAPLCLSQQLRMKGDYVLGGLFPLGEAEEAGLRSRTRPSSP >> VCTRFSSNGLLWALAMKMAVEEINNKSDLLPGLRLGYDLFDTCSEPVVAMKPSLMFLAKA >> GSRDIAAYCNYTQYQPRVLAVIGPHSSELAMVTGKFFSFFLMPQVSYGASMELLSARETF >> PSFFRTVPSDRVQLTAAAELLQEFGWNWVAALGSDDEYGRQGLSIFSALAAARGICIAHE >> GLVPLPRADDSRLGKVQDVLHQVNQSSVQVVLLFASVHAAHALFNYSISSRLSPKVWVAS >> EAWLTSDLVMGLPGMAQMGTVLGFLQRGAQLHEFPQYVKTHLALATDPAFCSALGEREQG >> LEEDVVGQRCPQCDCITLQNVSAGLNHHQTFSVYAAVYSVAQALHNTLQCNASGCPAQDP >> VKPWQLLENMYNLTFHVGGLPLRFDSSGNVDMEYDLKLWVWQGSVPRLHDVGRFNGSLRT >> ERLKIRWHTSDNQKPVSRCSRQCQEGQVRRVKGFHSCCYDCVDCEAGSYRQNPDDIACTF >> CGQDEWSPERSTRCFRRRSRFLAWGEPAVLLLLLLLSLALGLVLAALGLFVHHRDSPLVQ >> ASGGPLACFGLVCLGLVCLSVLLFPGQPSPARCLAQQPLSHLPLTGCLSTLFLQAAEIFV >> ESELPLSWADRLSGCLRGPWAWLVVLLAMLVEVALCTWYLVAFPPEVVTDWHMLPTEALV >> HCRTRSWVSFGLAHATNATLAFLCFLGTFLVRSQPGRYNRARGLTFAMLAYFITWVSFVP >> LLANVQVVLRPAVQMGALLLCVLGILAAFHLPRCYLLMRQPGLNTPEFFLGGGPGDAQGQ >> NDGNTGNQGKHE A fasta file with the protein sequences for uc010nyk.2 >> >> And I wonder how I can download the protein fasta file for all hg19 >> proteins with UCSC identifier. >> >> thank you! >> >> Sincerely, Evan Bai Yale University >> _______________________________________________ Genome maillist - >> [email protected] >> https://lists.soe.ucsc.edu/mailman/listinfo/genome >> _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
