Hi, I found some strange lines in the GTF output that can be retrieved from the Table Browser.
In a file downloaded end of 2010 there were lines with start > end. This was reported about a year ago in the mailing list with this subject: "Table Browser may return GTF lines with start > end" and I couldn't find any such lines when downloading the same data today. However, when looking at one of the same regions that had a CDS with start > end before, I noticed that there is now a strange entry for the stop codon. I looked up the following region in the table browser: Mammal-Human-hg18 Genes and Gene Prediction Tracks - RefSeq Genes table: refGene position: chr1:92537110-92626320 output format: GTF The relevant part is this: chr1 hg18_refGene CDS 92618869 92619016 0.000000 + 1 gene_id "NM_024813"; transcript_id "NM_024813"; chr1 hg18_refGene exon 92618869 92619018 0.000000 + . gene_id "NM_024813"; transcript_id "NM_024813"; chr1 hg18_refGene stop_codon 92619017 92625156 0.000000 + . gene_id "NM_024813"; transcript_id "NM_024813"; chr1 hg18_refGene exon 92625156 92626320 0.000000 + . gene_id "NM_024813"; transcript_id "NM_024813"; Please note that the stop_codon entry has a length of almost 6000 bases, of which the majority is not within one of the exons of this transcript. This might be caused by the fact that the stop codon is divided by the splice site. The GTF annotation allows for spliced stop codon, see here (http://mblab.wustl.edu/GTF22.html): "The "start_codon" and "stop_codon" features are not required to be atomic; they may be interrupted by valid splice sites. " I guess that the correct thing would be to insert two stop_codon entries instead of one, but there might be reasons to keep it that way. Thanks and best regards, Hilmar _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
