Hi Pauline and Mary, Thank you very much for the information regarding the start positions being 0 based whilst end positions being 1 based. I made that adjustment and the translation for that gene (uc010nya.1) came out to be exactly the same as the protein sequence as mentioned in UCSC database for that gene. My question is when I did the translation is the same manner (assuming CDS start as codon start\translation start, CDS end codon end with 0 based starts and 1 based end) the results for some genes/transcripts did not match with their protein sequences present in UCSC database for example uc001adj.1, uc001ail.2, uc009vjq.2 etc.
Again, I used the chr1.fa.masked downloaded for UCSC Genome Browser ftp Downloads for hg19 for parsing the exon boundaries. Then joined the exon sequences and began the translation reading frame from CDS start till CDS end. Below is an example of a part of translation sequence for uc001ail.2 (beginning from CDS start) and the its protein sequence obtained for UCSC Genome Browser. > uc001ail.2 partially translated sequence beginning from CDS start LGASQHLGYKDDLVGLLHIPP*LQQGRHHEWVVWIKVRRWHPGDADGLRLAALHGAPGGL NR > uc001ail.2 protein sequence from UCSC database MGTGVASMITCSIEGSVLNMGYVIAGESVSSGFKLQNNSLLPIKFSMHLDSLSSTRGRGQ QQLPQFLSSPSQRTEVVGTQNLNGQSVFSVAPVKGVMDPGKTQDFTVTFSPDHESLYFSD KLQVVLFEKKISHQILLKGAACQHMMFVEGGDPLDVPVESLTAIPVFDPRHREASSRPGP LSPEAEELRPILVTLDYIQFDTDTPAPPATRELQVGCIRTTQPSPKKTVEFSIDSVASLQ HKGFSIEPSRGSVERGQTKTISISWVPPADFDPDHPLMVSALLQLRGDVKETYKVIFVAQ VLTGP Further, for the genes\transcripts that I checked the translation results, if codon at CDS start is ATG (i.e. corresponding to M) the translation results for those genes match\100% align with their protein sequences in UCSC database but for the other genes their translation result does not match with UCSC database. It will be helpful if you could let me know why for those genes\transcripts the translation results are different with their protein sequences in UCSC Database. Thank you, Rahil Sethi > Hello jlw, > > Looking at the two protein sequences send in the previous question in > this thread they seem to diverge right before the end of the first exon > so I wonder if your program isn't parsing the exon/intron boundaries > correctly? > > Another issue which may affect coordinate calculation - does your > software take into account UCSCs 0-based start and 1-based end > coordinate system? Please see this FAQ for more information: > > http://genome.ucsc.edu/FAQ/FAQtracks.html#tracks1 > > Hopefully this information was helpful and answers your question. If you > have further questions or require clarification feel free to contact the > mailing list at [email protected]. > > Best regards, > > Pauline Fujita > > UCSC Genome Bioinformatics Group > http://genome.ucsc.edu > > > > > On 10/13/10 5:09 AM, James Lyons-Weiler wrote: >> mary... >> >> this use of cds start issue has been very confusing to us here. maybe >> you >> can help with additional details. >> >> what does it mean to 'use the cds start as the start codon'... in terms >> of >> algorithms, please? do you mean a literal translation from that codon, >> whether the 1st triplet of the cds is atg or not? >> >> what are the consequences of using the cds start as the start codon when >> the transcription start codon in known and annotated and should be used, >> instead? >> >> mary, fyi we are using our own translator, not any ucsc software. is >> the >> ucsc software programmed to anticipate the cds start as the 'start' >> codon >> but still return the translation of the annotated transcript or >> something? >> >> jlw >> director >> bioinformatics analysis core >> pitt >> >> >> >>> Hi Mary, >>> >>> Thanks for the answer but I would like to know why the transaltion >>> result >>> for many genes come out to be different when CDS start mentioned in >>> USCS >>> Genome Browser Table is considered to be the translation start? >>> For example I downloaded a complete chromosome 1 of hg19 from UCSC >>> Genome >>> Browser ftp downloads whole genome and then I carefully extracted the >>> all >>> the exonic regions starting from base at CDS start till base at CDS end >>> for a gene/transcript(uc010nya.1), the exon start and end positions and >>> CDS start and end obtained from UCSC. Then I translated those regions >>> assuming that reading frame begins from CDS start(translation start) >>> and >>> the string of amino acids differ from the protein sequence of >>> uc010nya.1 >>> obtained from UCSC Genome Browser. >>> >>> The two sequences are: >>> >>> >>>> Manual translation from CDS start till CDS end for uc010nya.1 >>>> >>> MSESRQTHVTLHDIDPQALDQLVQFAYTAEIVVGEGNVQDSAPSRQSPAA >>> EWRPRRLLQVSTESARPLQLPGYPGLCRCALLQRPAQGRPQVRAAALRGR >>> GQDRGVYAAAPETGNSWRAQPSXXXXXXXXLCL*LPTPFCS*HSPAHNP* >>> CLLCVPETFLDLGPPGASSVAPDSARPLPV*TLSPHLLTX >>> >>> >>>> uc010nya.1 obtained from UCSC Genome Browser table >>>> >>> MSESRQTHVTLHDIDPQALDQLVQFAYTAEIVVGEGNVQTLLPAASLLQLNGVRDACCKF >>> LLSQLDPSNCLGIRGFADAHSCSDLLKAAHRYVLQHFVDVAKTEEFMLLPLKQVTAGGPS >>> PRPPPHPTPVFVFDSRPRFVPDTALPTILSACCVSPRPFWIWAPQEPRLWLLTLLGPSQY >>> EHSAPTC >>> >>> From first line of the first sequence after "GNVQ" you will start >>> seeing >>> the deviation from the second sequence. >>> >>> Please let me know why does it then differ. >>> >>> Thank you, >>> Rahil Sethi >>> >>> >>>> Hi Rahil, >>>> >>>> Thank you so much for giving the assembly, track and table you were >>>> using when you encountered your question - it is much appreciated! >>>> >>>> UCSC Genes does not have cdsStartStat, cdsEndStat or exonFrames fields >>>> like most of our gene prediction tracks (more information about why >>>> can >>>> be found in this previous mailing list question: >>>> https://lists.soe.ucsc.edu/pipermail/genome/2010-September/023585.html). >>>> This means that you can use the CDS start and CDS end as start and >>>> stop >>>> codons. Please keep in mind that we have made the CDS start equal the >>>> CDS end for non-coding genes. >>>> >>>> I hope this information is helpful. Please feel free to contact the >>>> mail list again if you require further assistance. >>>> >>>> Best, >>>> Mary >>>> ------------------ >>>> Mary Goldman >>>> UCSC Bioinformatics Group >>>> >>>> On 10/11/10 7:29 AM, [email protected] wrote: >>>> >>>>> Hello, >>>>> >>>>> I am trying to extract the codon start and codon stop for a set of >>>>> genes >>>>> in a given position, from Tables in UCSC Genome Browser. Whenever I >>>>> click >>>>> output for Genes and Gene Predictions in a chromosome posiition >>>>> range, >>>>> it >>>>> gives me all the feature of genes like exon start, exon stop, CDS >>>>> start, >>>>> CDS stop, but does not give me the codon start (start position of the >>>>> first codon i.e. translation start) and codon stop (position of stop >>>>> codon >>>>> i.e. translation stop). >>>>> >>>>> Please let me know how can I get this information? >>>>> >>>>> I am using: >>>>> Genome: Hg19 >>>>> Group: Genes and Gene Prediction Tracks >>>>> Track: UCSC Genes >>>>> Table: KnownGene >>>>> region: defined regions >>>>> >>>>> Thank you, >>>>> Rahil Sethi >>>>> _______________________________________________ >>>>> Genome maillist - [email protected] >>>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome >>>>> >>>>> >>> >> >> >> -- >> Thank you very much, >> >> James Lyons-Weiler >> >> >> Director, Bioinformatics Analysis Core >> Genomics and Proteomics Core Laboratories >> Department of Biomedical Informatics >> University of Pittsburgh Cancer Institute >> 3rd Floor >> 3343 Forbes Avenue >> Pittsburgh, PA 15260 >> phone: 412-728-8743 >> reply-to: [email protected] >> _______________________________________________ >> Genome maillist - [email protected] >> https://lists.soe.ucsc.edu/mailman/listinfo/genome >> > > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
