Hi Brooke, Sorry about my confusion. And thanks again for your clarification.
Best, Evan On Jul 24, 2012, at 3:04 PM, Brooke Rhead wrote: > Hi Evan, > > The protein translation you see in the main Genome Browser window is based on > the sequence of the reference genome. The knownGenePep sequence is a > translation of the transcript in the knownGeneMrna table. If you would > rather have the translation that matches the reference genome, you can use > knownGeneTxPep. This is more obvious when you read the description of all of > these tables (I should have included this before!): > > *knownGeneMrna* contains the mRNA sequence that represents each UCSC Genes > transcript. If the transcript is based on a RefSeq transcript, then this > table contains the RefSeq transcript, including any portions that do not > align to the genome. > > *knownGeneTxMrna* contains mRNA sequences for each UCSC Genes transcript. In > contrast to the sequencess in knownGeneMrna, these sequences are derived by > obtaining the sequences for each exon from the reference genome and > concatenating these exonic sequences. > > *knownGenePep* contains the protein sequences derived from the knownGeneMrna > transcript sequences. Any protein-level annotations, such as the contents of > the knownToPfam table, are based on these sequences. > > *knownGeneTxPep* contains the protein translation (if any) of each mRNA > sequence in knownGeneTxMrna. > > I hope that makes more sense. > > -- > Brooke Rhead > UCSC Genome Bioinformatics Group > > > > On 7/24/12 10:04 AM, Evan Bai wrote: >> Hi, Brooke, >> >> Thanks again for the quick reply. >> The knownGenePep data set is extremely useful. However, I have found >> some slight discrepencies between the peptide sequence stored in this >> file versus what is shown in the actual genome browser. For example, as >> for hg19 genomic position* chr1:1,269,549-1,269,558*, in the browser >> window, the peptide sequence for *uc010nyk.2* is *PGCY*. However, in >> the actual knownGenePep data table, the recorded sequence is: >> "MLGPAVLGLSLWALLHPGTGAPLCLSQQLRMKGDYVLGGLFPLGEAEEAGLRSRTRPSSPVCTRFSSNGLLWALAMKMAVEEINNKSDLLPGLRLGYDLFDTCSEPVVAMKPSLMFLAKAGSRDIAAYCNYTQYQPRVLAVIGPHSSELAMVTGKFFSFFLMPQVSYGASMELLSARETFPSFFRTVPSDRVQLTAAAELLQEFGWNWVAALGSDDEYGRQGLSIFSALAAARGICIAHEGLVPLPRADDSRLGKVQDVLHQVNQSSVQVVLLFASVHAAHALFNYSISSRLSPKVWVASEAWLTSDLVMGLPGMAQMGTVLGFLQRGAQLHEFPQYVKTHLALATDPAFCSALGEREQGLEEDVVGQRCPQCDCITLQNVSAGLNHHQTFSVYAAVYSVAQALHNTLQCNASGCPAQDPVKPWQLLENMYNLTFHVGGLPLRFDSSGNVDMEYDLKLWVWQGSVPRLHDVGRFNGSLRTERLKIRWHTSDNQKPVSRCSRQCQEGQVRRVKGFHSCCYDCVDCEAGSYRQNPDDIACTFCGQDEWSPERSTRCFRRRSRFLAWGEPAVLLLLLLLSLALGLVLAALGLFVHHRDSPLVQASGGPLACFGLVCLGLVCLSVLLFPGQPSPARCLAQQPLSHLPLTGCLSTLFLQAAEIFVESELPLSWADRLSGCLRGPWAWLVVLLAMLVEVALCTWYLVAFPPEVVTDWHMLPTEALVHCRTRSWVSFGLAHATNATLAFLCFLGTFLVRSQ*PGRY*NRARGLTFAMLAYFITWVSFVPLLANVQVVLRPAVQMGALLLCVLGILAAFHLPRCYLLMRQPGLNTPEFFLGGGPGDAQGQNDGNTGNQGKHE" >> >> I did have notice that there is a C/T snp (rs307377) at chr1:1,269,554. >> When the nucleotide is C, the resulting AA would be R; whereas if the >> nucleotide is T, the AA would be C. I know that these are very trivial >> differences. But they do introduce false positives in my analyses. I >> suspect that the browser version is the more updated one, and would >> really appreciate it if there is any way to download that version of the >> knownGenePep file. >> >> Thank you so much! >> Best, >> Evan >> >> >> On Jul 23, 2012, at 7:12 PM, Brooke Rhead wrote: >> >>> Hi Evan, >>> >>> The protein sequence for UCSC Genes is actually kept in the table >>> knownGenePep. (This is an odd case; we generally do not store >>> sequence in tables for the Genome Browser.) You can download the >>> table from our downloads server: >>> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGenePep.txt.gz >>> or from the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables >>> >>> Be aware that there is a second table, knownGeneTxPep, that contains a >>> slightly different sequence for some of the peptides. There is a >>> description of the difference on the hg19 UCSC Genes description page >>> (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene): >>> >>> *knownGenePep* contains the protein sequences derived from the >>> knownGeneMrna transcript sequences. Any protein-level annotations, >>> such as the contents of the knownToPfam table, are based on these >>> sequences. >>> >>> *knownGeneTxPep* contains the protein translation (if any) of each >>> mRNA sequence in knownGeneTxMrna. >>> >>> I see another question from you with the same subject line from last >>> week; I think this response answers both questions. If not, or if you >>> have further questions, please write back to us at [email protected]. >>> >>> -- >>> Brooke Rhead >>> UCSC Genome Bioinformatics Group >>> >>> >>> On 7/23/12 10:06 AM, Evan Bai wrote: >>>> Hi, >>>> >>>> I have a question regarding retrieving protein fasta sequences from >>>> the genome browser. >>>> >>>> For example, when I searched for "uc010nyk.2" in the browser, clicked >>>> on the gene, and then clicked on "Protein (852 aa)", it lead me to >>>> this result: >>>>> uc010nyk.2 (TAS1R3) length=852 >>>> MLGPAVLGLSLWALLHPGTGAPLCLSQQLRMKGDYVLGGLFPLGEAEEAGLRSRTRPSSP >>>> VCTRFSSNGLLWALAMKMAVEEINNKSDLLPGLRLGYDLFDTCSEPVVAMKPSLMFLAKA >>>> GSRDIAAYCNYTQYQPRVLAVIGPHSSELAMVTGKFFSFFLMPQVSYGASMELLSARETF >>>> PSFFRTVPSDRVQLTAAAELLQEFGWNWVAALGSDDEYGRQGLSIFSALAAARGICIAHE >>>> GLVPLPRADDSRLGKVQDVLHQVNQSSVQVVLLFASVHAAHALFNYSISSRLSPKVWVAS >>>> EAWLTSDLVMGLPGMAQMGTVLGFLQRGAQLHEFPQYVKTHLALATDPAFCSALGEREQG >>>> LEEDVVGQRCPQCDCITLQNVSAGLNHHQTFSVYAAVYSVAQALHNTLQCNASGCPAQDP >>>> VKPWQLLENMYNLTFHVGGLPLRFDSSGNVDMEYDLKLWVWQGSVPRLHDVGRFNGSLRT >>>> ERLKIRWHTSDNQKPVSRCSRQCQEGQVRRVKGFHSCCYDCVDCEAGSYRQNPDDIACTF >>>> CGQDEWSPERSTRCFRRRSRFLAWGEPAVLLLLLLLSLALGLVLAALGLFVHHRDSPLVQ >>>> ASGGPLACFGLVCLGLVCLSVLLFPGQPSPARCLAQQPLSHLPLTGCLSTLFLQAAEIFV >>>> ESELPLSWADRLSGCLRGPWAWLVVLLAMLVEVALCTWYLVAFPPEVVTDWHMLPTEALV >>>> HCRTRSWVSFGLAHATNATLAFLCFLGTFLVRSQPGRYNRARGLTFAMLAYFITWVSFVP >>>> LLANVQVVLRPAVQMGALLLCVLGILAAFHLPRCYLLMRQPGLNTPEFFLGGGPGDAQGQ >>>> NDGNTGNQGKHE A fasta file with the protein sequences for uc010nyk.2 >>>> >>>> And I wonder how I can download the protein fasta file for all hg19 >>>> proteins with UCSC identifier. >>>> >>>> thank you! >>>> >>>> Sincerely, Evan Bai Yale University >>>> _______________________________________________ Genome maillist - >>>> [email protected] >>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome >>>> >> _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
