Hi Aaron, Thank you for sending your examples. I do not see the discrepancy between the refFlat file, the chrom fasta sequences, and what is displayed in the browser for each.
In Mouse July 2007 (mm9, NCBI Build 37), bases 1-3,000,000 of chr1 are annotated as "N". This represents the (estimated) telomeric region. The Table Browser can be used to determine the exact length for any given assembly. The links in this previous thread provide instructions for how to do this: http://www.soe.ucsc.edu/pipermail/genome/2008-July/016798.html The coordinates for your mouse RefSeq example sequence, NM_013715, are identical between the mm9 genome assembly browser and the refFlat table/file. Your example from Chicken is also consistent. You do not need to adjust alignment coordinates to compensate for any "N" regions in the base chromosome. The only coordinate adjustments you may require are 1) to interpret alignments on the negative strand correctly. Here is a link to a thread explaining: http://www.soe.ucsc.edu/pipermail/genome/2007-September/014688.html 2) to interpret the the zero-based start coordinate correctly. Here is a link to our FAQ explaining: http://genome.ucsc.edu/FAQ/FAQtracks#tracks1 I hope this helps to clarify the data. Please let us know if you need any additional help/information or if I misinterpreted your question, Jennifer Jackson UCSC Genome Bioinformatics Group Aaron Skewes wrote: > Hi, > > > > I am attempting to extract the nucleotide sequences for exons in several > genomes based on their locations listed in the refFlat.txt. In almost all > cases, the exonStarts-exonEnds do not correspond to the nucleotide position > relative to the refSeq for that particular organism and chromosome. For > example, mouse build37 has a 30Mbp gap at the start of all chromosomes, > except for Y. This gap is shown in the sequence with "N" but that is omitted > from the refFlat table. In other words, nucleotide position 30x10^6 + 1 = > position 0 in the refFlat. In chicken (and others), there are gaps > interspersed throughout many of the assembled chromosomes, shown with "N", > but refFlat locations are not offset by the gap lengths. > > > > Can somebody please suggest to me how I can extract genomic features based > on nucleotide position programmatically, if the refFlat positions do not > match the nucleotide positions and the offsets are unknown? > > > > Thank you, > > Aaron > > > > _______________________________________________ > Genome maillist - [email protected] > http://www.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ Genome maillist - [email protected] http://www.soe.ucsc.edu/mailman/listinfo/genome
