Hi Rachel and Katrina, Thanks for looking into this and for the extensive explanations, that's very useful! I have one comment/suggestion below.
On 08/16/2010 09:46 AM, Rachel Harte wrote: > Hello Hervé, > > I have some more information to add to the reply below after reviewing > all the CCDS in your list below. I can confirm that all the CCDS in your > list, that have in-frame stop codons, are for genes encoding > selenoproteins so the TGA stop codon in these transcripts is translated > as the amino acid, selenocysteine. At the moment, the UCSC Genome > Browser does not recognise these TGA codons as selenocysteine codons so > they are coloured red as if they are stop codons when you zoom in to the > base level on the Genome Browser. The CCDS with non-ATG start codons all > have alternative translation start codons such as CTG or GTG and there > is experimental evidence that suggests that these alternate start codons > are the predominant ones used for these genes. I am going to update the > CCDS track description page on the UCSC Genome Browser to explain these > exceptions to the criteria listed there. Thanks for this. What about the protein sequence provided by the Browser? If I understand correctly, it should not be truncated at the first in-frame stop codon because this codon actually gets translated. > Thank you for drawing our > attention to this omission. > > Finally, the three CCDS that have nucleotide lengths that are divisible > by three, are pending withdrawal from the CCDS set. This is because > these CCDS are from genes that are known to be polymorphic and the > reference genome allele contains a 1 nt insert and cannot encode the > protein as the 1 nt insert causes a frameshift in translation which can > cause the protein to be truncated and contain erroneous sequence so that > the protein is not likely to be functional. Since CCDS is an annotation > of the reference genome, we can not create a CCDS on the reference > genome that encodes the normal protein for these genes. I see. Thanks for the clarification. > > CCDS is constantly being reviewed for such cases and for new evidence > that requires the CCDS to be updated. NCBI releases a new CCDS set > periodically and then these updates come into effect. If you do see any > other potential problems with CCDS, then please notify the CCDS group at > [email protected]. Thank you. I will. Thanks again for the clarifications! Cheers, H. > > Rachel > > On 8/13/10 3:00 PM, Katrina Learned wrote: >> Hi Hervé, >> >> Thank you for your email. One of our staff members is also part of >> CCDS project and she has offered the following information: >> >> CCDS43034.1 is actually a selenoprotein (SELO, selenoprotein O) and so >> it has an in-frame stop codon because, in this protein, the in-frame >> stop codon is translated to a selenocysteine. We are currently >> determining if this is the case for the other CCDS you found with >> in-frame stop codons. >> >> As for the CCDS without start codons, there are some CCDS that have >> been annotated with a non-ATG start codon e.g. CTG where there is >> experimental evidence to suggest that the protein is translated from >> the non-ATG start codon. >> >> Finally, CCDS is constantly being updated, and so the project members >> are continually reviewing CCDS and correcting any errors or updating >> annotations based on additional evidence that becomes available. These >> updates are released periodically. >> >> We are currently looking into your additional observations in more >> detail. Please don't hesitate to contact the mail list again if you >> have any further questions. >> >> Katrina Learned >> UCSC Genome Bioinformatics Group >> >> Hervé Pagès wrote, On 08/13/10 12:50: >>> Hi, >>> >>> According to the Methods section of the CCDS track page for hg18, >>> one of the criteria used to assess each gene is: >>> >>> - an initiating ATG, a valid stop codon, and no in-frame stop codons >>> >>> However when using some tools to extract and translate the transcripts >>> for all the genes in the track, I find that some of the genes fail to >>> satisfy the criteria. More precisely: >>> >>> - 21 genes fail to have an initiating ATG (e.g. CCDS43136.1, >>> CCDS34059.1, etc..., see full listing at the end of the email). >>> >>> - 15 genes fail to have no in-frame stop codons. E.g. the >>> CCDS43034.1 gene (on chr22 strand +) has an in-frame stop >>> codon 9 base upstream the stop codon located at the position >>> specified in the cdsEnd column of the ccdsGene table for >>> that gene. >>> >>> When using the Genome Browser to display CCDS43136.1 and CCDS43034.1 >>> for hg18, I can *see* a confirmation of the problem. But if I click on >>> the CCDS43034.1 gene and then follow the link to the protein sequence >>> then the sequence is truncated at the in-frame stop codon, not at the >>> stop codon located at ccdsGene.cdsEnd. So I'm wondering why isn't >>> ccdsGene.cdsEnd set to the end of the effective stop codon? >>> >>> For hg19, the situation is slightly worse. In addition to having genes >>> with the same problems as reported above, 3 genes have a cumulated >>> CDS length that is not even a multiple of 3 (CCDS47664.1, CCDS47663.1 >>> and CCDS45377.1). >>> >>> I would be very thankful if someone could provide some insight about >>> this. >>> >>> Thanks, >>> H. >>> >>> Full listing of failing genes for hg18: >>> - without an initiating ATG: >>> CCDS43136.1, CCDS34059.1, CCDS43376.1, CCDS34458.1, CCDS34457.1, >>> CCDS34737.1, CCDS6359.2, CCDS35004.1, CCDS35044.1, CCDS7878.2, >>> CCDS7877.2, CCDS41618.1, CCDS31428.1, CCDS31730.1, CCDS31729.1, >>> CCDS42102.1, CCDS32514.1, CCDS33104.1, CCDS33460.1, CCDS33646.1, >>> CCDS33647.1 >>> - with one or more in-frame stop codons: >>> CCDS41340.1, CCDS41339.1, CCDS41283.1, CCDS41282.1, CCDS43091.1, >>> CCDS43389.1, CCDS43432.1, CCDS41964.1, CCDS41992.1, CCDS42100.1, >>> CCDS42150.1, CCDS42457.1, CCDS42981.1, CCDS43003.1, CCDS43034.1 >>> >> _______________________________________________ >> Genome maillist - [email protected] >> https://lists.soe.ucsc.edu/mailman/listinfo/genome > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: [email protected] Phone: (206) 667-5791 Fax: (206) 667-1319 _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
