Hi Brooke,

Sorry about my confusion.  And thanks again for your clarification.

Best,
Evan

On Jul 24, 2012, at 3:04 PM, Brooke Rhead wrote:

> Hi Evan,
> 
> The protein translation you see in the main Genome Browser window is based on 
> the sequence of the reference genome.  The knownGenePep sequence is a 
> translation of the transcript in the knownGeneMrna table.  If you would 
> rather have the translation that matches the reference genome, you can use 
> knownGeneTxPep.  This is more obvious when you read the description of all of 
> these tables (I should have included this before!):
> 
> *knownGeneMrna* contains the mRNA sequence that represents each UCSC Genes 
> transcript. If the transcript is based on a RefSeq transcript, then this 
> table contains the RefSeq transcript, including any portions that do not 
> align to the genome.
> 
> *knownGeneTxMrna* contains mRNA sequences for each UCSC Genes transcript. In 
> contrast to the sequencess in knownGeneMrna, these sequences are derived by 
> obtaining the sequences for each exon from the reference genome and 
> concatenating these exonic sequences.
> 
> *knownGenePep* contains the protein sequences derived from the knownGeneMrna 
> transcript sequences. Any protein-level annotations, such as the contents of 
> the knownToPfam table, are based on these sequences.
> 
> *knownGeneTxPep* contains the protein translation (if any) of each mRNA 
> sequence in knownGeneTxMrna.
> 
> I hope that makes more sense.
> 
> --
> Brooke Rhead
> UCSC Genome Bioinformatics Group
> 
> 
> 
> On 7/24/12 10:04 AM, Evan Bai wrote:
>> Hi, Brooke,
>> 
>> Thanks again for the quick reply.
>> The knownGenePep data set is extremely useful.  However, I have found
>> some slight discrepencies between the peptide sequence stored in this
>> file versus what is shown in the actual genome browser.  For example, as
>> for hg19 genomic position* chr1:1,269,549-1,269,558*, in the browser
>> window, the peptide sequence for *uc010nyk.2* is *PGCY*.  However, in
>> the actual knownGenePep data table, the recorded sequence is:
>> "MLGPAVLGLSLWALLHPGTGAPLCLSQQLRMKGDYVLGGLFPLGEAEEAGLRSRTRPSSPVCTRFSSNGLLWALAMKMAVEEINNKSDLLPGLRLGYDLFDTCSEPVVAMKPSLMFLAKAGSRDIAAYCNYTQYQPRVLAVIGPHSSELAMVTGKFFSFFLMPQVSYGASMELLSARETFPSFFRTVPSDRVQLTAAAELLQEFGWNWVAALGSDDEYGRQGLSIFSALAAARGICIAHEGLVPLPRADDSRLGKVQDVLHQVNQSSVQVVLLFASVHAAHALFNYSISSRLSPKVWVASEAWLTSDLVMGLPGMAQMGTVLGFLQRGAQLHEFPQYVKTHLALATDPAFCSALGEREQGLEEDVVGQRCPQCDCITLQNVSAGLNHHQTFSVYAAVYSVAQALHNTLQCNASGCPAQDPVKPWQLLENMYNLTFHVGGLPLRFDSSGNVDMEYDLKLWVWQGSVPRLHDVGRFNGSLRTERLKIRWHTSDNQKPVSRCSRQCQEGQVRRVKGFHSCCYDCVDCEAGSYRQNPDDIACTFCGQDEWSPERSTRCFRRRSRFLAWGEPAVLLLLLLLSLALGLVLAALGLFVHHRDSPLVQASGGPLACFGLVCLGLVCLSVLLFPGQPSPARCLAQQPLSHLPLTGCLSTLFLQAAEIFVESELPLSWADRLSGCLRGPWAWLVVLLAMLVEVALCTWYLVAFPPEVVTDWHMLPTEALVHCRTRSWVSFGLAHATNATLAFLCFLGTFLVRSQ*PGRY*NRARGLTFAMLAYFITWVSFVPLLANVQVVLRPAVQMGALLLCVLGILAAFHLPRCYLLMRQPGLNTPEFFLGGGPGDAQGQNDGNTGNQGKHE"
>> 
>> I did have notice that there is a C/T snp (rs307377) at chr1:1,269,554.
>>  When the nucleotide is C, the resulting AA would be R; whereas if the
>> nucleotide is T, the AA would be C. I know that these are very trivial
>> differences.  But they do introduce false positives in my analyses.  I
>> suspect that the browser version is the more updated one, and would
>> really appreciate it if there is any way to download that version of the
>> knownGenePep file.
>> 
>> Thank you so much!
>> Best,
>> Evan
>> 
>> 
>> On Jul 23, 2012, at 7:12 PM, Brooke Rhead wrote:
>> 
>>> Hi Evan,
>>> 
>>> The protein sequence for UCSC Genes is actually kept in the table
>>> knownGenePep.  (This is an odd case; we generally do not store
>>> sequence in tables for the Genome Browser.)  You can download the
>>> table from our downloads server:
>>> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGenePep.txt.gz
>>> or from the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables
>>> 
>>> Be aware that there is a second table, knownGeneTxPep, that contains a
>>> slightly different sequence for some of the peptides.  There is a
>>> description of the difference on the hg19 UCSC Genes description page
>>> (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene):
>>> 
>>> *knownGenePep* contains the protein sequences derived from the
>>> knownGeneMrna transcript sequences. Any protein-level annotations,
>>> such as the contents of the knownToPfam table, are based on these
>>> sequences.
>>> 
>>> *knownGeneTxPep* contains the protein translation (if any) of each
>>> mRNA sequence in knownGeneTxMrna.
>>> 
>>> I see another question from you with the same subject line from last
>>> week; I think this response answers both questions.  If not, or if you
>>> have further questions, please write back to us at [email protected].
>>> 
>>> --
>>> Brooke Rhead
>>> UCSC Genome Bioinformatics Group
>>> 
>>> 
>>> On 7/23/12 10:06 AM, Evan Bai wrote:
>>>> Hi,
>>>> 
>>>> I have a question regarding retrieving protein fasta sequences from
>>>> the genome browser.
>>>> 
>>>> For example, when I searched for "uc010nyk.2" in the browser, clicked
>>>> on the gene, and then clicked on "Protein (852 aa)", it lead me to
>>>> this result:
>>>>> uc010nyk.2 (TAS1R3) length=852
>>>> MLGPAVLGLSLWALLHPGTGAPLCLSQQLRMKGDYVLGGLFPLGEAEEAGLRSRTRPSSP
>>>> VCTRFSSNGLLWALAMKMAVEEINNKSDLLPGLRLGYDLFDTCSEPVVAMKPSLMFLAKA
>>>> GSRDIAAYCNYTQYQPRVLAVIGPHSSELAMVTGKFFSFFLMPQVSYGASMELLSARETF
>>>> PSFFRTVPSDRVQLTAAAELLQEFGWNWVAALGSDDEYGRQGLSIFSALAAARGICIAHE
>>>> GLVPLPRADDSRLGKVQDVLHQVNQSSVQVVLLFASVHAAHALFNYSISSRLSPKVWVAS
>>>> EAWLTSDLVMGLPGMAQMGTVLGFLQRGAQLHEFPQYVKTHLALATDPAFCSALGEREQG
>>>> LEEDVVGQRCPQCDCITLQNVSAGLNHHQTFSVYAAVYSVAQALHNTLQCNASGCPAQDP
>>>> VKPWQLLENMYNLTFHVGGLPLRFDSSGNVDMEYDLKLWVWQGSVPRLHDVGRFNGSLRT
>>>> ERLKIRWHTSDNQKPVSRCSRQCQEGQVRRVKGFHSCCYDCVDCEAGSYRQNPDDIACTF
>>>> CGQDEWSPERSTRCFRRRSRFLAWGEPAVLLLLLLLSLALGLVLAALGLFVHHRDSPLVQ
>>>> ASGGPLACFGLVCLGLVCLSVLLFPGQPSPARCLAQQPLSHLPLTGCLSTLFLQAAEIFV
>>>> ESELPLSWADRLSGCLRGPWAWLVVLLAMLVEVALCTWYLVAFPPEVVTDWHMLPTEALV
>>>> HCRTRSWVSFGLAHATNATLAFLCFLGTFLVRSQPGRYNRARGLTFAMLAYFITWVSFVP
>>>> LLANVQVVLRPAVQMGALLLCVLGILAAFHLPRCYLLMRQPGLNTPEFFLGGGPGDAQGQ
>>>> NDGNTGNQGKHE A fasta file with the protein sequences for uc010nyk.2
>>>> 
>>>> And I wonder how I can download the protein fasta file for all hg19
>>>> proteins with UCSC identifier.
>>>> 
>>>> thank you!
>>>> 
>>>> Sincerely, Evan Bai Yale University
>>>> _______________________________________________ Genome maillist  -
>>>> [email protected]
>>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>>> 
>> 


_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to