Re: [Genome] UCSC protein fasta database

Evan Bai Tue, 24 Jul 2012 10:22:20 -0700

Hi, Brooke,

Thanks again for the quick reply.
The knownGenePep data set is extremely useful.  However, I have found some 
slight discrepencies between the peptide sequence stored in this file versus 
what is shown in the actual genome browser.  For example, as for hg19 genomic 
position chr1:1,269,549-1,269,558, in the browser window, the peptide sequence 
for uc010nyk.2 is PGCY.  However, in the actual knownGenePep data table, the 
recorded sequence is:
"MLGPAVLGLSLWALLHPGTGAPLCLSQQLRMKGDYVLGGLFPLGEAEEAGLRSRTRPSSPVCTRFSSNGLLWALAMKMAVEEINNKSDLLPGLRLGYDLFDTCSEPVVAMKPSLMFLAKAGSRDIAAYCNYTQYQPRVLAVIGPHSSELAMVTGKFFSFFLMPQVSYGASMELLSARETFPSFFRTVPSDRVQLTAAAELLQEFGWNWVAALGSDDEYGRQGLSIFSALAAARGICIAHEGLVPLPRADDSRLGKVQDVLHQVNQSSVQVVLLFASVHAAHALFNYSISSRLSPKVWVASEAWLTSDLVMGLPGMAQMGTVLGFLQRGAQLHEFPQYVKTHLALATDPAFCSALGEREQGLEEDVVGQRCPQCDCITLQNVSAGLNHHQTFSVYAAVYSVAQALHNTLQCNASGCPAQDPVKPWQLLENMYNLTFHVGGLPLRFDSSGNVDMEYDLKLWVWQGSVPRLHDVGRFNGSLRTERLKIRWHTSDNQKPVSRCSRQCQEGQVRRVKGFHSCCYDCVDCEAGSYRQNPDDIACTFCGQDEWSPERSTRCFRRRSRFLAWGEPAVLLLLLLLSLALGLVLAALGLFVHHRDSPLVQASGGPLACFGLVCLGLVCLSVLLFPGQPSPARCLAQQPLSHLPLTGCLSTLFLQAAEIFVESELPLSWADRLSGCLRGPWAWLVVLLAMLVEVALCTWYLVAFPPEVVTDWHMLPTEALVHCRTRSWVSFGLAHATNATLAFLCFLGTFLVRSQPGRYNRARGLTFAMLAYFITWVSFVPLLANVQVVLRPAVQMGALLLCVLGILAAFHLPRCYLLMRQPGLNTPEFFLGGGPGDAQGQNDGNTGNQGKHE"


I did have notice that there is a C/T snp (rs307377) at chr1:1,269,554.  When 
the nucleotide is C, the resulting AA would be R; whereas if the nucleotide is 
T, the AA would be C. I know that these are very trivial differences.  But they 
do introduce false positives in my analyses.  I suspect that the browser 
version is the more updated one, and would really appreciate it if there is any 
way to download that version of the knownGenePep file.

Thank you so much!
Best,
Evan


On Jul 23, 2012, at 7:12 PM, Brooke Rhead wrote:

> Hi Evan,
> 
> The protein sequence for UCSC Genes is actually kept in the table 
> knownGenePep.  (This is an odd case; we generally do not store sequence in 
> tables for the Genome Browser.)  You can download the table from our 
> downloads server:
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGenePep.txt.gz
> or from the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables
> 
> Be aware that there is a second table, knownGeneTxPep, that contains a 
> slightly different sequence for some of the peptides.  There is a description 
> of the difference on the hg19 UCSC Genes description page 
> (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene):
> 
> *knownGenePep* contains the protein sequences derived from the knownGeneMrna 
> transcript sequences. Any protein-level annotations, such as the contents of 
> the knownToPfam table, are based on these sequences.
> 
> *knownGeneTxPep* contains the protein translation (if any) of each mRNA 
> sequence in knownGeneTxMrna.
> 
> I see another question from you with the same subject line from last week; I 
> think this response answers both questions.  If not, or if you have further 
> questions, please write back to us at [email protected].
> 
> --
> Brooke Rhead
> UCSC Genome Bioinformatics Group
> 
> 
> On 7/23/12 10:06 AM, Evan Bai wrote:
>> Hi,
>> 
>> I have a question regarding retrieving protein fasta sequences from
>> the genome browser.
>> 
>> For example, when I searched for "uc010nyk.2" in the browser, clicked
>> on the gene, and then clicked on "Protein (852 aa)", it lead me to
>> this result:
>>> uc010nyk.2 (TAS1R3) length=852
>> MLGPAVLGLSLWALLHPGTGAPLCLSQQLRMKGDYVLGGLFPLGEAEEAGLRSRTRPSSP
>> VCTRFSSNGLLWALAMKMAVEEINNKSDLLPGLRLGYDLFDTCSEPVVAMKPSLMFLAKA
>> GSRDIAAYCNYTQYQPRVLAVIGPHSSELAMVTGKFFSFFLMPQVSYGASMELLSARETF
>> PSFFRTVPSDRVQLTAAAELLQEFGWNWVAALGSDDEYGRQGLSIFSALAAARGICIAHE
>> GLVPLPRADDSRLGKVQDVLHQVNQSSVQVVLLFASVHAAHALFNYSISSRLSPKVWVAS
>> EAWLTSDLVMGLPGMAQMGTVLGFLQRGAQLHEFPQYVKTHLALATDPAFCSALGEREQG
>> LEEDVVGQRCPQCDCITLQNVSAGLNHHQTFSVYAAVYSVAQALHNTLQCNASGCPAQDP
>> VKPWQLLENMYNLTFHVGGLPLRFDSSGNVDMEYDLKLWVWQGSVPRLHDVGRFNGSLRT
>> ERLKIRWHTSDNQKPVSRCSRQCQEGQVRRVKGFHSCCYDCVDCEAGSYRQNPDDIACTF
>> CGQDEWSPERSTRCFRRRSRFLAWGEPAVLLLLLLLSLALGLVLAALGLFVHHRDSPLVQ
>> ASGGPLACFGLVCLGLVCLSVLLFPGQPSPARCLAQQPLSHLPLTGCLSTLFLQAAEIFV
>> ESELPLSWADRLSGCLRGPWAWLVVLLAMLVEVALCTWYLVAFPPEVVTDWHMLPTEALV
>> HCRTRSWVSFGLAHATNATLAFLCFLGTFLVRSQPGRYNRARGLTFAMLAYFITWVSFVP
>> LLANVQVVLRPAVQMGALLLCVLGILAAFHLPRCYLLMRQPGLNTPEFFLGGGPGDAQGQ
>> NDGNTGNQGKHE A fasta file with the protein sequences for uc010nyk.2
>> 
>> And I wonder how I can download the protein fasta file for all hg19
>> proteins with UCSC identifier.
>> 
>> thank you!
>> 
>> Sincerely, Evan Bai Yale University
>> _______________________________________________ Genome maillist  -
>> [email protected]
>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>> 

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] UCSC protein fasta database

Reply via email to