Re: [Genome] UCSC protein fasta database

Brooke Rhead Tue, 24 Jul 2012 12:05:09 -0700

Hi Evan,

The protein translation you see in the main Genome Browser window is 
based on the sequence of the reference genome.  The knownGenePep 
sequence is a translation of the transcript in the knownGeneMrna table. 
  If you would rather have the translation that matches the reference 
genome, you can use knownGeneTxPep.  This is more obvious when you read 
the description of all of these tables (I should have included this 
before!):


*knownGeneMrna* contains the mRNA sequence that represents each UCSC 
Genes transcript. If the transcript is based on a RefSeq transcript, 
then this table contains the RefSeq transcript, including any portions 
that do not align to the genome.

*knownGeneTxMrna* contains mRNA sequences for each UCSC Genes 
transcript. In contrast to the sequencess in knownGeneMrna, these 
sequences are derived by obtaining the sequences for each exon from the 
reference genome and concatenating these exonic sequences.

*knownGenePep* contains the protein sequences derived from the 
knownGeneMrna transcript sequences. Any protein-level annotations, such 
as the contents of the knownToPfam table, are based on these sequences.

*knownGeneTxPep* contains the protein translation (if any) of each mRNA 
sequence in knownGeneTxMrna.

I hope that makes more sense.

--
Brooke Rhead
UCSC Genome Bioinformatics Group



On 7/24/12 10:04 AM, Evan Bai wrote:
> Hi, Brooke,
>
> Thanks again for the quick reply.
> The knownGenePep data set is extremely useful.  However, I have found
> some slight discrepencies between the peptide sequence stored in this
> file versus what is shown in the actual genome browser.  For example, as
> for hg19 genomic position* chr1:1,269,549-1,269,558*, in the browser
> window, the peptide sequence for *uc010nyk.2* is *PGCY*.  However, in
> the actual knownGenePep data table, the recorded sequence is:
> "MLGPAVLGLSLWALLHPGTGAPLCLSQQLRMKGDYVLGGLFPLGEAEEAGLRSRTRPSSPVCTRFSSNGLLWALAMKMAVEEINNKSDLLPGLRLGYDLFDTCSEPVVAMKPSLMFLAKAGSRDIAAYCNYTQYQPRVLAVIGPHSSELAMVTGKFFSFFLMPQVSYGASMELLSARETFPSFFRTVPSDRVQLTAAAELLQEFGWNWVAALGSDDEYGRQGLSIFSALAAARGICIAHEGLVPLPRADDSRLGKVQDVLHQVNQSSVQVVLLFASVHAAHALFNYSISSRLSPKVWVASEAWLTSDLVMGLPGMAQMGTVLGFLQRGAQLHEFPQYVKTHLALATDPAFCSALGEREQGLEEDVVGQRCPQCDCITLQNVSAGLNHHQTFSVYAAVYSVAQALHNTLQCNASGCPAQDPVKPWQLLENMYNLTFHVGGLPLRFDSSGNVDMEYDLKLWVWQGSVPRLHDVGRFNGSLRTERLKIRWHTSDNQKPVSRCSRQCQEGQVRRVKGFHSCCYDCVDCEAGSYRQNPDDIACTFCGQDEWSPERSTRCFRRRSRFLAWGEPAVLLLLLLLSLALGLVLAALGLFVHHRDSPLVQASGGPLACFGLVCLGLVCLSVLLFPGQPSPARCLAQQPLSHLPLTGCLSTLFLQAAEIFVESELPLSWADRLSGCLRGPWAWLVVLLAMLVEVALCTWYLVAFPPEVVTDWHMLPTEALVHCRTRSWVSFGLAHATNATLAFLCFLGTFLVRSQ*PGRY*NRARGLTFAMLAYFITWVSFVPLLANVQVVLRPAVQMGALLLCVLGILAAFHLPRCYLLMRQPGLNTPEFFLGGGPGDAQGQNDGNTGNQGKHE"
>
> I did have notice that there is a C/T snp (rs307377) at chr1:1,269,554.
>   When the nucleotide is C, the resulting AA would be R; whereas if the
> nucleotide is T, the AA would be C. I know that these are very trivial
> differences.  But they do introduce false positives in my analyses.  I
> suspect that the browser version is the more updated one, and would
> really appreciate it if there is any way to download that version of the
> knownGenePep file.
>
> Thank you so much!
> Best,
> Evan
>
>
> On Jul 23, 2012, at 7:12 PM, Brooke Rhead wrote:
>
>> Hi Evan,
>>
>> The protein sequence for UCSC Genes is actually kept in the table
>> knownGenePep.  (This is an odd case; we generally do not store
>> sequence in tables for the Genome Browser.)  You can download the
>> table from our downloads server:
>> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGenePep.txt.gz
>> or from the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables
>>
>> Be aware that there is a second table, knownGeneTxPep, that contains a
>> slightly different sequence for some of the peptides.  There is a
>> description of the difference on the hg19 UCSC Genes description page
>> (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene):
>>
>> *knownGenePep* contains the protein sequences derived from the
>> knownGeneMrna transcript sequences. Any protein-level annotations,
>> such as the contents of the knownToPfam table, are based on these
>> sequences.
>>
>> *knownGeneTxPep* contains the protein translation (if any) of each
>> mRNA sequence in knownGeneTxMrna.
>>
>> I see another question from you with the same subject line from last
>> week; I think this response answers both questions.  If not, or if you
>> have further questions, please write back to us at [email protected].
>>
>> --
>> Brooke Rhead
>> UCSC Genome Bioinformatics Group
>>
>>
>> On 7/23/12 10:06 AM, Evan Bai wrote:
>>> Hi,
>>>
>>> I have a question regarding retrieving protein fasta sequences from
>>> the genome browser.
>>>
>>> For example, when I searched for "uc010nyk.2" in the browser, clicked
>>> on the gene, and then clicked on "Protein (852 aa)", it lead me to
>>> this result:
>>>> uc010nyk.2 (TAS1R3) length=852
>>> MLGPAVLGLSLWALLHPGTGAPLCLSQQLRMKGDYVLGGLFPLGEAEEAGLRSRTRPSSP
>>> VCTRFSSNGLLWALAMKMAVEEINNKSDLLPGLRLGYDLFDTCSEPVVAMKPSLMFLAKA
>>> GSRDIAAYCNYTQYQPRVLAVIGPHSSELAMVTGKFFSFFLMPQVSYGASMELLSARETF
>>> PSFFRTVPSDRVQLTAAAELLQEFGWNWVAALGSDDEYGRQGLSIFSALAAARGICIAHE
>>> GLVPLPRADDSRLGKVQDVLHQVNQSSVQVVLLFASVHAAHALFNYSISSRLSPKVWVAS
>>> EAWLTSDLVMGLPGMAQMGTVLGFLQRGAQLHEFPQYVKTHLALATDPAFCSALGEREQG
>>> LEEDVVGQRCPQCDCITLQNVSAGLNHHQTFSVYAAVYSVAQALHNTLQCNASGCPAQDP
>>> VKPWQLLENMYNLTFHVGGLPLRFDSSGNVDMEYDLKLWVWQGSVPRLHDVGRFNGSLRT
>>> ERLKIRWHTSDNQKPVSRCSRQCQEGQVRRVKGFHSCCYDCVDCEAGSYRQNPDDIACTF
>>> CGQDEWSPERSTRCFRRRSRFLAWGEPAVLLLLLLLSLALGLVLAALGLFVHHRDSPLVQ
>>> ASGGPLACFGLVCLGLVCLSVLLFPGQPSPARCLAQQPLSHLPLTGCLSTLFLQAAEIFV
>>> ESELPLSWADRLSGCLRGPWAWLVVLLAMLVEVALCTWYLVAFPPEVVTDWHMLPTEALV
>>> HCRTRSWVSFGLAHATNATLAFLCFLGTFLVRSQPGRYNRARGLTFAMLAYFITWVSFVP
>>> LLANVQVVLRPAVQMGALLLCVLGILAAFHLPRCYLLMRQPGLNTPEFFLGGGPGDAQGQ
>>> NDGNTGNQGKHE A fasta file with the protein sequences for uc010nyk.2
>>>
>>> And I wonder how I can download the protein fasta file for all hg19
>>> proteins with UCSC identifier.
>>>
>>> thank you!
>>>
>>> Sincerely, Evan Bai Yale University
>>> _______________________________________________ Genome maillist  -
>>> [email protected]
>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>>
>
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] UCSC protein fasta database

Reply via email to