Hello Haiwei,
You can get this information in a two-step process using the Table
Browser tool
on our website ('Tables' from the top blue navigation bar). You will need to
repeat these two steps for each of the five species that you are
interested in.
STEP ONE. To get the gene name, protein ID, gene starting and ending
position
on the chromosome, and strandedness follow these steps:
1. Navigate to the Table Browser and choose your organism.
2. Select the RefSeq Genes track (refGene table).
3. As 'output type' choose "selected fields from primary and related tables".
4. Then press the "get output" button.
5. From the next page, choose the following fields from the refGene table:
name, chrom, strand, txStart, txEnd
6. Scroll down and choose the kgXref table and press the "Allow Selection From
Checked Tables" button.
7. Scroll down and, from the kgXref table, choose the protAcc field.
8. Press the "get output" button.
The output will be a list of all of the RefSeq Genes for that
assembly/organism
with their name, chrom, strand, transcription start & end, and their protein
accession. Like so:
#hg18.refGene.name hg18.refGene.chrom hg18.refGene.strand
hg18.refGene.txStart hg18.refGene.txEnd hg18.kgXref.protAcc
NM_000808 chrX - 151086289 151370487 NP_000799
STEP TWO. To get the predicted protein for each of the RefSeq Genes,
follow
these steps:
1. Navigate to the Table Browser and choose your organism.
2. Select the RefSeq Genes track (refGene table).
3. As 'output type' choose "sequence".
4. then press the "get output" button.
5. From the next page, choose 'protein', then press the "submit" button.
The output will be a list of the protein sequence for all RefSeq Genes,
like so:
>NP_000799.1
MIITQTSHCYMTSLGILFLINILPGTTGQGESRRQEPGDFVKQDIGGLSP
KHAPDIPDDSTDNITIFTRILDRLLDGYDNRLRPGLGDAVTEVKTDIYVT
SFGPVSDTDMEYTIDVFFRQTWHDERLKFDGPMKILPLNNLLASKIWTPD
TFFHNGKKSVAHNMTTPNKLLRLVDNGTLLYTMRLTIHAECPMHLEDFPM
DVHACPLKFGSYAYTTAEVVYSWTLGKNKSVEVAQDGSRLNQYDLLGHVV
GTEIIRSSTGEYVVMTTHFHLKRKIGYFVIQTYLPCIMTVILSQVSFWLN
RESVPARTVFGVTTVLTMTTLSISARNSLPKVAYATAMDWFIAVCYAFVF
SALIEFATVNYFTKRSWAWEGKKVPEALEMKKKTPAAPAKKTSTTFNIVG
TTYPINLAKDTEFSTISKGAAPSASSTPTIIASPKATYVQDSPTETKTYN
SVSKVDKISRIIFPVLFAIFNLVYWATYVNRESAIKGMIRKQ
You can relate the RefSeq Gene names with the proper protein names by
reviewing
the output from step ONE.
If you need help getting started with the Table Browser, please visit
the
User's Guide: http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html
I hope this information is helpful to you. Please don't hesitate to
contact
the mail list again if you require further assistance.
Regards,
----------
Ann Zweig
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu
Please feel free to search the Genome mailing list archives by visiting our
home
page, clicking on "Contact Us", then typing a word or phrase into the search
box. On that same page
(http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome mailing
list.
Haiwei Luo wrote:
> Dear colleagues,
>
> I am a graduate student in University of South Carolina. In my research
> project, I need UCSC genome annotations of the following five species: Homo
> sapiens, Mus musculus, Drosophila melanogaster, Anopheles gambiae, and
> Caenorhabditis elegans. Genome annotations may contain predicted protein
> sequences, protein ID, gene starting, ending position on the chromosome, and
> strandedness. I don't know where I can find those files. Could you kindly
> send me links to locate those files?
>
> Sincere thanks,
> Haiwei Luo
> _______________________________________________
> Genome maillist - [email protected]
> http://www.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist - [email protected]
http://www.soe.ucsc.edu/mailman/listinfo/genome