I think AF410783 is in knownGene when you download it .... but it's missing
it's 3' exon.  So it's not a case of the transcript not being included...
it's a case of the knownGene table missing exon data.

There'a a lot of these, IMO.

On Fri, Jun 17, 2011 at 12:56 PM, Mary Goldman <[email protected]> wrote:

> Hi Joel,
>
> One of our developers had this to say:
>
> "The transcript uc001qdm.1 is based on AF410783, as well as many of the
> other mRNAs that align to the TIRAP locus. If we dissect the locus
> from 5' to 3', we start with a tiny 5' exon, and then next is a
> cassette exon with an alt 5' splice site. That alt 5' region is only
> 15 bases long and is easy to miss, but it's an important
> distinguishing feature: transcripts uc001qdl.1 and uc001qdm.1 and
> mRNAs AF378129 and AF410783 have the long form of the exon while
> transcript uc001qdn.1 and mRNA BC032474 have the short form.
> Continuing in the 3' direction, there are three more constitutive
> exons (with the second one containing the CDS start). The UCSC
> transcripts and mRNAs end either with a short bleeding exon just at
> the 3' end of that 3rd constitutive exon, or they have another intron
> and one or more additional exons. Our friend uc001qdm.1 represents
> the transcripts that include this intron plus the long form of that
> cassette exon (the 2nd exon). Notice that of the two refseqs, there's
> one that contains the intron plus the short form of the cassette exon,
> and one that contains the long form of the cassette exon plus the
> bleeding exon. So uc001qdm.1 won't be based on a refseq. There is a
> refseq in the refseq column of kgXref, but it's not the best evidence
> for the UCSC transcript, it's the single refseq that overlapped the
> UCSC transcript by the most bases.
>
> When refseq / mRNA / EST sequences are assembled into UCSC Genes
> transcripts, each sequence has a certain weight according to the
> evidence type (high for refseq, low for mRNA, very low for EST). The
> exons and introns that you see in UCSC Genes transcripts are the ones
> that are supported by a total amount of weight equivalent to at least
> two mRNAs or one mRNA and two ESTs. You don't see the interesting 3'
> exons of AF410783 in any UCSC Genes transcript because they're not
> supported by any other sequence. If there was another sequence with
> these exons, you'd see a UCSC Genes transcript that looked more like
> AF410783. Instead, the transcript has a single large 3' UTR exon,
> which is supported by a sufficient number of sequences: namely, the
> shorter mRNAs that start in the 4th exon of AF410783. So the code
> essentially decided that there's not enough evidence to support the 3'
> exons of AF410783, but there is sufficient evidence for a single 3'
> UTR exon.
>
> In the kgXref table, the mRNA column contains the mRNA or EST sequence
> that best represents the UCSC Genes transcript. Here, "best" implies
> the greatest number of overlapping bases (and starting in the next
> version of UCSC Genes, the "best" mRNA will also have a consistent
> splicing pattern, if such an mRNA exists). This is why AF410783 is
> the best representative of uc001qdm.1, even if they look quite
> different."
>
> For your other question, the details page should be able to answer how
> we decide which GenBank sequences to include:
> http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene.
>
> I hope this information is helpful.  Please feel free to contact the
> mail list again if you require further assistance.
>
> Best,
> Mary
> ------------------
> Mary Goldman
> UCSC Bioinformatics Group
>
>
>
> On 6/14/11 11:29 AM, Parker, Joel wrote:
> > I am having trouble relating the content of the knownGene table to what
> > is shown for GenBank.  This is all based on the hg19 build. As an
> > example, searching for the genbank ID AF410783 maps to one form of TIRAP
> > in knownGene.  The most 3' region of this gene is displayed in the
> > browser around 126,165,000, but the form with the same ID in the Human
> > mRNAs section has a region that extends out to>125,167,000.  What is
> > the cause of this apparent discrepancy?  Also, what is the determining
> > factor behind inclusion of a GenBank sequence in knownGene?
> >
> >
> >
> > Thanks,
> >
> > Joel
> >
> >
> >
> >
> >
> > _______________________________________________
> > Genome maillist  -  [email protected]
> > https://lists.soe.ucsc.edu/mailman/listinfo/genome
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to