Hello Alexandra, It seems there is some confusion about what the alignment blocks (exons) represent versus the coordinates for txStart/End and cdsStart/End.
The exons are the regions of the entire transcript that align to genomic. This includes the 5'UTR, CDS, and 3'UTR. The first exon starts at position 1467732 +1 = 1467733. *note about adding 1 to the start: Remember that you need to add 1 to convert the 0-based, half-open coordinates (which is how UCSC stores coordinates in the mySQL tables and most files) to be 1-based, fully-closed coordinates (which is what is in the display). The cdsStart in contained within this first exon (block), starting at position 1467752 +1 = 1467753. For this exon, the first part of it is 5' UTR and second part of it is CDS (coding). This can be seen in the graphical display in the browser, drill in close to the position and notice the thin and thick display. Thin represents non-coding, thick represents coding. If you choose to export "CDS" data from the Table browser, in any format (including fasta, in batch), it will be limited to the portion of the transcript defined by the CDS. For a quick way to get the protein sequence per transcript, locate the sequence in the Browser assembly viewer, click on the sequence, scroll down a bit on the description page and use the "Links to Sequences: -> Predicted Protein". Some help links: http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html#GeneDisplay http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms http://genome.ucsc.edu/FAQ/FAQformat.html#format9 http://genome.ucsc.edu/FAQ/FAQformat.html#format1 http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#Sequence If you have any follow-up questions, we would be glad to offer more assistance, Jen --------------------------------- Jennifer Jackson UCSC Genome Informatics Group http://genome.ucsc.edu/ On 4/28/10 12:40 AM, Rapoport Alexandra wrote: > Greetings! > Debugging my script I found the following annotation, and I think there is a > problem: > > C.elegans, assembly ce6, CHR > refGene table entry: > > #bin name chrom strand txStart txEnd cdsStart cdsEnd > exonCount exonStarts exonEnds score name2 cdsStartStat > cdsEndStat exonFrames > 596 NM_061576 chrII + 1467732 1469600 1467752 1469560 7 > 1467732,1468089,1468422,1468750,1468880,1468999,1469128, > 1468040,1468373,1468542,1468830,1468953,1469069,1469600, > > Through genome browser the first codon starts at position 1467752 and not > at 1467732 (as in database). If I use the data from database, it is > impossible to get the right protein sequence. > Is there any possibility to handle such cases automatically (I mean by some > script and not by looking at the result)? Actually, I find this one by > accident :) > Sincerely Yours, > Alexandra Rapoport > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
