Re: [Genome] problem with genome coordinates in "KnownGene" files

Katrina Learned Fri, 11 Mar 2011 11:48:21 -0800

Hi Eric,

Please see this FAQ from our web site:
http://genome.ucsc.edu/FAQ/FAQtracks.html#tracks1


I hope this is helpful. Please don't hesitate to contact the mail list 
again if you have any further questions.

Katrina Learned
UCSC Genome Bioinformatics Group

Eric Foss wrote, On 03/11/11 11:14:
> Dear UCSC Genome, 
>
> I'm having some trouble making sense of genome coordinates in a "known genes" 
> file that I downloaded from your site. I have genomic sequences from some 
> cancer patients with information about which mutations are specific to the 
> cancer. The information comes as, for example, "there is a missense A to T 
> mutation at base pair 8894776 on chromosome 7 for the gene with Entrez ID 
> 5426". I want to know what that implies for the protein in question, so I 
> find which uc identifier corresponds to Entrez ID 5426, make a list of those 
> identifiers and submit it through your "Tables" function asking for genomic 
> sequence with coordinates from human genome build 37 and I get back a FASTA 
> file with entries like this: 
>
>   
>> hg19_knownGene_uc001abw.1 range=chr1:861121-879961 5'pad=0 3'pad=0 strand=+ 
>> repeatMasking=none
>>     
> GCAGATCCCTGCGGCGTTCGCGAGGGTGGGACGGGAAGCGGGCTGGGAAG
> TCGGGCCGAGgtgggtgtggggttcggggtgtatttcgtccacgagccgg
> ggagggggtactggccctgccgctgactgcgcgcagaagcgtgccgctcc
> ctcacagggtctgcctcggctctgctcgcagGGAAAAGTCTGAAGACGCT
> TATGTCCAAGGGGATCCTGCAGGTGCATCCTCCGATCTGCGACTGCCCGG    ... etc. 
>
> I have found the coordinates and sequences listed in these FASTA files to be 
> 100% reliable, meaning that when my data say that, for example,  base pair 
> 8894776 on chromosome 7 is an A, then my FASTA file always confirms the same 
> thing, confirms that it is an exon if it's in capitals in the returned file 
> or an intron if it's in small letters, etc. 
>
> I then want to find the protein sequence to figure out the consequence of the 
> mutation, and that should be easy to automate based on downloading your file 
> "knownGene.txt", whose format is described here: 
>
> http://genome.ucsc.edu/cgi-bin/hgTables
>
> Here I run into trouble. The numbering system in this file is inconsistent. 
> Sticking just with genes on the + strand (to rule out me making a mistake 
> based on being confused by reverse numbering on the - strand), I find that 
> sometimes the first base of the first exon will be listed as starting at the 
> same coordinate as the first base of the first exon of my FASTA sequences, 
> but sometimes they will be listed as being one coordinate less than the first 
> base of my first exon. Similarly, I will find that the first coordinate in 
> the coding sequence will be listed as the coordinate corresponding to the 
> first base in my FASTA exon or one less than that. Furthermore, the coding 
> sequence start will sometimes be listed as the true coding position start 
> (i.e. the starting ATG), but sometimes it will be listed as the start of the 
> first exon even though the coding region doesn't start until well into the 
> gene (and again, with these internal coding sequences, I find problems with 
> being !
 1 !
>  off sometimes but not always). Can you suggest how I should proceed to make 
> this all make sense? 
>
> Thank you very much. 
>
> Sincerely, 
>
> Eric
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>   


_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] problem with genome coordinates in "KnownGene" files

Reply via email to