I think AF410783 is in knownGene when you download it .... but it's missing it's 3' exon. So it's not a case of the transcript not being included... it's a case of the knownGene table missing exon data.
There'a a lot of these, IMO. On Fri, Jun 17, 2011 at 12:56 PM, Mary Goldman <[email protected]> wrote: > Hi Joel, > > One of our developers had this to say: > > "The transcript uc001qdm.1 is based on AF410783, as well as many of the > other mRNAs that align to the TIRAP locus. If we dissect the locus > from 5' to 3', we start with a tiny 5' exon, and then next is a > cassette exon with an alt 5' splice site. That alt 5' region is only > 15 bases long and is easy to miss, but it's an important > distinguishing feature: transcripts uc001qdl.1 and uc001qdm.1 and > mRNAs AF378129 and AF410783 have the long form of the exon while > transcript uc001qdn.1 and mRNA BC032474 have the short form. > Continuing in the 3' direction, there are three more constitutive > exons (with the second one containing the CDS start). The UCSC > transcripts and mRNAs end either with a short bleeding exon just at > the 3' end of that 3rd constitutive exon, or they have another intron > and one or more additional exons. Our friend uc001qdm.1 represents > the transcripts that include this intron plus the long form of that > cassette exon (the 2nd exon). Notice that of the two refseqs, there's > one that contains the intron plus the short form of the cassette exon, > and one that contains the long form of the cassette exon plus the > bleeding exon. So uc001qdm.1 won't be based on a refseq. There is a > refseq in the refseq column of kgXref, but it's not the best evidence > for the UCSC transcript, it's the single refseq that overlapped the > UCSC transcript by the most bases. > > When refseq / mRNA / EST sequences are assembled into UCSC Genes > transcripts, each sequence has a certain weight according to the > evidence type (high for refseq, low for mRNA, very low for EST). The > exons and introns that you see in UCSC Genes transcripts are the ones > that are supported by a total amount of weight equivalent to at least > two mRNAs or one mRNA and two ESTs. You don't see the interesting 3' > exons of AF410783 in any UCSC Genes transcript because they're not > supported by any other sequence. If there was another sequence with > these exons, you'd see a UCSC Genes transcript that looked more like > AF410783. Instead, the transcript has a single large 3' UTR exon, > which is supported by a sufficient number of sequences: namely, the > shorter mRNAs that start in the 4th exon of AF410783. So the code > essentially decided that there's not enough evidence to support the 3' > exons of AF410783, but there is sufficient evidence for a single 3' > UTR exon. > > In the kgXref table, the mRNA column contains the mRNA or EST sequence > that best represents the UCSC Genes transcript. Here, "best" implies > the greatest number of overlapping bases (and starting in the next > version of UCSC Genes, the "best" mRNA will also have a consistent > splicing pattern, if such an mRNA exists). This is why AF410783 is > the best representative of uc001qdm.1, even if they look quite > different." > > For your other question, the details page should be able to answer how > we decide which GenBank sequences to include: > http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene. > > I hope this information is helpful. Please feel free to contact the > mail list again if you require further assistance. > > Best, > Mary > ------------------ > Mary Goldman > UCSC Bioinformatics Group > > > > On 6/14/11 11:29 AM, Parker, Joel wrote: > > I am having trouble relating the content of the knownGene table to what > > is shown for GenBank. This is all based on the hg19 build. As an > > example, searching for the genbank ID AF410783 maps to one form of TIRAP > > in knownGene. The most 3' region of this gene is displayed in the > > browser around 126,165,000, but the form with the same ID in the Human > > mRNAs section has a region that extends out to>125,167,000. What is > > the cause of this apparent discrepancy? Also, what is the determining > > factor behind inclusion of a GenBank sequence in knownGene? > > > > > > > > Thanks, > > > > Joel > > > > > > > > > > > > _______________________________________________ > > Genome maillist - [email protected] > > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
