Dear UCSC Genome, I'm having some trouble making sense of genome coordinates in a "known genes" file that I downloaded from your site. I have genomic sequences from some cancer patients with information about which mutations are specific to the cancer. The information comes as, for example, "there is a missense A to T mutation at base pair 8894776 on chromosome 7 for the gene with Entrez ID 5426". I want to know what that implies for the protein in question, so I find which uc identifier corresponds to Entrez ID 5426, make a list of those identifiers and submit it through your "Tables" function asking for genomic sequence with coordinates from human genome build 37 and I get back a FASTA file with entries like this:
>hg19_knownGene_uc001abw.1 range=chr1:861121-879961 5'pad=0 3'pad=0 strand=+ >repeatMasking=none GCAGATCCCTGCGGCGTTCGCGAGGGTGGGACGGGAAGCGGGCTGGGAAG TCGGGCCGAGgtgggtgtggggttcggggtgtatttcgtccacgagccgg ggagggggtactggccctgccgctgactgcgcgcagaagcgtgccgctcc ctcacagggtctgcctcggctctgctcgcagGGAAAAGTCTGAAGACGCT TATGTCCAAGGGGATCCTGCAGGTGCATCCTCCGATCTGCGACTGCCCGG ... etc. I have found the coordinates and sequences listed in these FASTA files to be 100% reliable, meaning that when my data say that, for example, base pair 8894776 on chromosome 7 is an A, then my FASTA file always confirms the same thing, confirms that it is an exon if it's in capitals in the returned file or an intron if it's in small letters, etc. I then want to find the protein sequence to figure out the consequence of the mutation, and that should be easy to automate based on downloading your file "knownGene.txt", whose format is described here: http://genome.ucsc.edu/cgi-bin/hgTables Here I run into trouble. The numbering system in this file is inconsistent. Sticking just with genes on the + strand (to rule out me making a mistake based on being confused by reverse numbering on the - strand), I find that sometimes the first base of the first exon will be listed as starting at the same coordinate as the first base of the first exon of my FASTA sequences, but sometimes they will be listed as being one coordinate less than the first base of my first exon. Similarly, I will find that the first coordinate in the coding sequence will be listed as the coordinate corresponding to the first base in my FASTA exon or one less than that. Furthermore, the coding sequence start will sometimes be listed as the true coding position start (i.e. the starting ATG), but sometimes it will be listed as the start of the first exon even though the coding region doesn't start until well into the gene (and again, with these internal coding sequences, I find problems with being 1 ! off sometimes but not always). Can you suggest how I should proceed to make this all make sense? Thank you very much. Sincerely, Eric _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
