Re: [Genome] help relating knownGene table to GenBank mRNA

Mary Goldman Fri, 17 Jun 2011 09:58:44 -0700

Hi Joel,

One of our developers had this to say:

"The transcript uc001qdm.1 is based on AF410783, as well as many of the
other mRNAs that align to the TIRAP locus. If we dissect the locus
from 5' to 3', we start with a tiny 5' exon, and then next is a
cassette exon with an alt 5' splice site. That alt 5' region is only
15 bases long and is easy to miss, but it's an important
distinguishing feature: transcripts uc001qdl.1 and uc001qdm.1 and
mRNAs AF378129 and AF410783 have the long form of the exon while
transcript uc001qdn.1 and mRNA BC032474 have the short form.
Continuing in the 3' direction, there are three more constitutive
exons (with the second one containing the CDS start). The UCSC
transcripts and mRNAs end either with a short bleeding exon just at
the 3' end of that 3rd constitutive exon, or they have another intron
and one or more additional exons. Our friend uc001qdm.1 represents
the transcripts that include this intron plus the long form of that
cassette exon (the 2nd exon). Notice that of the two refseqs, there's
one that contains the intron plus the short form of the cassette exon,
and one that contains the long form of the cassette exon plus the
bleeding exon. So uc001qdm.1 won't be based on a refseq. There is a
refseq in the refseq column of kgXref, but it's not the best evidence
for the UCSC transcript, it's the single refseq that overlapped the
UCSC transcript by the most bases.

When refseq / mRNA / EST sequences are assembled into UCSC Genes
transcripts, each sequence has a certain weight according to the
evidence type (high for refseq, low for mRNA, very low for EST). The
exons and introns that you see in UCSC Genes transcripts are the ones
that are supported by a total amount of weight equivalent to at least
two mRNAs or one mRNA and two ESTs. You don't see the interesting 3'
exons of AF410783 in any UCSC Genes transcript because they're not
supported by any other sequence. If there was another sequence with
these exons, you'd see a UCSC Genes transcript that looked more like
AF410783. Instead, the transcript has a single large 3' UTR exon,
which is supported by a sufficient number of sequences: namely, the
shorter mRNAs that start in the 4th exon of AF410783. So the code
essentially decided that there's not enough evidence to support the 3'
exons of AF410783, but there is sufficient evidence for a single 3'
UTR exon.

In the kgXref table, the mRNA column contains the mRNA or EST sequence
that best represents the UCSC Genes transcript. Here, "best" implies
the greatest number of overlapping bases (and starting in the next
version of UCSC Genes, the "best" mRNA will also have a consistent
splicing pattern, if such an mRNA exists). This is why AF410783 is
the best representative of uc001qdm.1, even if they look quite
different."

For your other question, the details page should be able to answer how 
we decide which GenBank sequences to include: 
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene.

I hope this information is helpful.  Please feel free to contact the 
mail list again if you require further assistance.

Best,
Mary
------------------
Mary Goldman
UCSC Bioinformatics Group

On 6/14/11 11:29 AM, Parker, Joel wrote:
> I am having trouble relating the content of the knownGene table to what
> is shown for GenBank.  This is all based on the hg19 build. As an
> example, searching for the genbank ID AF410783 maps to one form of TIRAP
> in knownGene.  The most 3' region of this gene is displayed in the
> browser around 126,165,000, but the form with the same ID in the Human
> mRNAs section has a region that extends out to>125,167,000.  What is
> the cause of this apparent discrepancy?  Also, what is the determining
> factor behind inclusion of a GenBank sequence in knownGene?
>
>
>
> Thanks,
>
> Joel
>
>
>
>
>
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] help relating knownGene table to GenBank mRNA

Reply via email to