Dear UCSC Genome, 

I'm having some trouble making sense of genome coordinates in a "known genes" 
file that I downloaded from your site. I have genomic sequences from some 
cancer patients with information about which mutations are specific to the 
cancer. The information comes as, for example, "there is a missense A to T 
mutation at base pair 8894776 on chromosome 7 for the gene with Entrez ID 
5426". I want to know what that implies for the protein in question, so I find 
which uc identifier corresponds to Entrez ID 5426, make a list of those 
identifiers and submit it through your "Tables" function asking for genomic 
sequence with coordinates from human genome build 37 and I get back a FASTA 
file with entries like this: 

>hg19_knownGene_uc001abw.1 range=chr1:861121-879961 5'pad=0 3'pad=0 strand=+ 
>repeatMasking=none
GCAGATCCCTGCGGCGTTCGCGAGGGTGGGACGGGAAGCGGGCTGGGAAG
TCGGGCCGAGgtgggtgtggggttcggggtgtatttcgtccacgagccgg
ggagggggtactggccctgccgctgactgcgcgcagaagcgtgccgctcc
ctcacagggtctgcctcggctctgctcgcagGGAAAAGTCTGAAGACGCT
TATGTCCAAGGGGATCCTGCAGGTGCATCCTCCGATCTGCGACTGCCCGG    ... etc. 

I have found the coordinates and sequences listed in these FASTA files to be 
100% reliable, meaning that when my data say that, for example,  base pair 
8894776 on chromosome 7 is an A, then my FASTA file always confirms the same 
thing, confirms that it is an exon if it's in capitals in the returned file or 
an intron if it's in small letters, etc. 

I then want to find the protein sequence to figure out the consequence of the 
mutation, and that should be easy to automate based on downloading your file 
"knownGene.txt", whose format is described here: 

http://genome.ucsc.edu/cgi-bin/hgTables

Here I run into trouble. The numbering system in this file is inconsistent. 
Sticking just with genes on the + strand (to rule out me making a mistake based 
on being confused by reverse numbering on the - strand), I find that sometimes 
the first base of the first exon will be listed as starting at the same 
coordinate as the first base of the first exon of my FASTA sequences, but 
sometimes they will be listed as being one coordinate less than the first base 
of my first exon. Similarly, I will find that the first coordinate in the 
coding sequence will be listed as the coordinate corresponding to the first 
base in my FASTA exon or one less than that. Furthermore, the coding sequence 
start will sometimes be listed as the true coding position start (i.e. the 
starting ATG), but sometimes it will be listed as the start of the first exon 
even though the coding region doesn't start until well into the gene (and 
again, with these internal coding sequences, I find problems with being 1 !
 off sometimes but not always). Can you suggest how I should proceed to make 
this all make sense? 

Thank you very much. 

Sincerely, 

Eric
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to