Hi Eric, Please see this FAQ from our web site: http://genome.ucsc.edu/FAQ/FAQtracks.html#tracks1
I hope this is helpful. Please don't hesitate to contact the mail list again if you have any further questions. Katrina Learned UCSC Genome Bioinformatics Group Eric Foss wrote, On 03/11/11 11:14: > Dear UCSC Genome, > > I'm having some trouble making sense of genome coordinates in a "known genes" > file that I downloaded from your site. I have genomic sequences from some > cancer patients with information about which mutations are specific to the > cancer. The information comes as, for example, "there is a missense A to T > mutation at base pair 8894776 on chromosome 7 for the gene with Entrez ID > 5426". I want to know what that implies for the protein in question, so I > find which uc identifier corresponds to Entrez ID 5426, make a list of those > identifiers and submit it through your "Tables" function asking for genomic > sequence with coordinates from human genome build 37 and I get back a FASTA > file with entries like this: > > >> hg19_knownGene_uc001abw.1 range=chr1:861121-879961 5'pad=0 3'pad=0 strand=+ >> repeatMasking=none >> > GCAGATCCCTGCGGCGTTCGCGAGGGTGGGACGGGAAGCGGGCTGGGAAG > TCGGGCCGAGgtgggtgtggggttcggggtgtatttcgtccacgagccgg > ggagggggtactggccctgccgctgactgcgcgcagaagcgtgccgctcc > ctcacagggtctgcctcggctctgctcgcagGGAAAAGTCTGAAGACGCT > TATGTCCAAGGGGATCCTGCAGGTGCATCCTCCGATCTGCGACTGCCCGG ... etc. > > I have found the coordinates and sequences listed in these FASTA files to be > 100% reliable, meaning that when my data say that, for example, base pair > 8894776 on chromosome 7 is an A, then my FASTA file always confirms the same > thing, confirms that it is an exon if it's in capitals in the returned file > or an intron if it's in small letters, etc. > > I then want to find the protein sequence to figure out the consequence of the > mutation, and that should be easy to automate based on downloading your file > "knownGene.txt", whose format is described here: > > http://genome.ucsc.edu/cgi-bin/hgTables > > Here I run into trouble. The numbering system in this file is inconsistent. > Sticking just with genes on the + strand (to rule out me making a mistake > based on being confused by reverse numbering on the - strand), I find that > sometimes the first base of the first exon will be listed as starting at the > same coordinate as the first base of the first exon of my FASTA sequences, > but sometimes they will be listed as being one coordinate less than the first > base of my first exon. Similarly, I will find that the first coordinate in > the coding sequence will be listed as the coordinate corresponding to the > first base in my FASTA exon or one less than that. Furthermore, the coding > sequence start will sometimes be listed as the true coding position start > (i.e. the starting ATG), but sometimes it will be listed as the start of the > first exon even though the coding region doesn't start until well into the > gene (and again, with these internal coding sequences, I find problems with > being ! 1 ! > off sometimes but not always). Can you suggest how I should proceed to make > this all make sense? > > Thank you very much. > > Sincerely, > > Eric > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
