Hi Joel, One of our developers had this to say:
"The transcript uc001qdm.1 is based on AF410783, as well as many of the other mRNAs that align to the TIRAP locus. If we dissect the locus from 5' to 3', we start with a tiny 5' exon, and then next is a cassette exon with an alt 5' splice site. That alt 5' region is only 15 bases long and is easy to miss, but it's an important distinguishing feature: transcripts uc001qdl.1 and uc001qdm.1 and mRNAs AF378129 and AF410783 have the long form of the exon while transcript uc001qdn.1 and mRNA BC032474 have the short form. Continuing in the 3' direction, there are three more constitutive exons (with the second one containing the CDS start). The UCSC transcripts and mRNAs end either with a short bleeding exon just at the 3' end of that 3rd constitutive exon, or they have another intron and one or more additional exons. Our friend uc001qdm.1 represents the transcripts that include this intron plus the long form of that cassette exon (the 2nd exon). Notice that of the two refseqs, there's one that contains the intron plus the short form of the cassette exon, and one that contains the long form of the cassette exon plus the bleeding exon. So uc001qdm.1 won't be based on a refseq. There is a refseq in the refseq column of kgXref, but it's not the best evidence for the UCSC transcript, it's the single refseq that overlapped the UCSC transcript by the most bases. When refseq / mRNA / EST sequences are assembled into UCSC Genes transcripts, each sequence has a certain weight according to the evidence type (high for refseq, low for mRNA, very low for EST). The exons and introns that you see in UCSC Genes transcripts are the ones that are supported by a total amount of weight equivalent to at least two mRNAs or one mRNA and two ESTs. You don't see the interesting 3' exons of AF410783 in any UCSC Genes transcript because they're not supported by any other sequence. If there was another sequence with these exons, you'd see a UCSC Genes transcript that looked more like AF410783. Instead, the transcript has a single large 3' UTR exon, which is supported by a sufficient number of sequences: namely, the shorter mRNAs that start in the 4th exon of AF410783. So the code essentially decided that there's not enough evidence to support the 3' exons of AF410783, but there is sufficient evidence for a single 3' UTR exon. In the kgXref table, the mRNA column contains the mRNA or EST sequence that best represents the UCSC Genes transcript. Here, "best" implies the greatest number of overlapping bases (and starting in the next version of UCSC Genes, the "best" mRNA will also have a consistent splicing pattern, if such an mRNA exists). This is why AF410783 is the best representative of uc001qdm.1, even if they look quite different." For your other question, the details page should be able to answer how we decide which GenBank sequences to include: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene. I hope this information is helpful. Please feel free to contact the mail list again if you require further assistance. Best, Mary ------------------ Mary Goldman UCSC Bioinformatics Group On 6/14/11 11:29 AM, Parker, Joel wrote: > I am having trouble relating the content of the knownGene table to what > is shown for GenBank. This is all based on the hg19 build. As an > example, searching for the genbank ID AF410783 maps to one form of TIRAP > in knownGene. The most 3' region of this gene is displayed in the > browser around 126,165,000, but the form with the same ID in the Human > mRNAs section has a region that extends out to>125,167,000. What is > the cause of this apparent discrepancy? Also, what is the determining > factor behind inclusion of a GenBank sequence in knownGene? > > > > Thanks, > > Joel > > > > > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
