[pygr] More on Ensembl data in UCSC

Marek Szuba Mon, 14 Dec 2009 18:28:00 -0800

Hello everyone,

Now that I've got a working prototype for this (which you can find in
my 'ucsc_ensembl' GitHub branch), Chris believes it is time to discuss
the final form - or, in other words, the schema. Here is some
information about the data in question and what needs to be done to
obtain different types of annotations.



        The Data

1. All the Ensembl data in UCSC is *transcript* data. You can get
transcript, gene, protein and/or exon annotations from the database but
in the end it all grinds down to appropriately manipulating transcript
data;

2. UCSC stores stable Ensembl identifiers of transcripts, genes and
proteins - but not exons. The latter can only be referred to by
combining the relevant stable Ensembl transcript ID with the exon's
offset (or, in Ensembl terminology, rank) within that transcript.

3. All Ensembl data in the UCSC database is stored in the following
three tables:
 - ensGene - contains transcript information, including blobs
containing exon data. Each row is identified by either stable Ensembl
transcript ID (uniquely it seems, even though it's not guaranteed by
the database) or stable Ensembl gene ID (not always uniquely);
 - ensPep - contains sequence data. Each row is identified by stable
Ensembl transcript ID (as primary key);
 - ensGtp - contains mappings between stable Ensembl transcript, gene
and protein IDs. Transcript IDs are unique, the other two - not
necessarily;

4. We are interested in the following tables in the native Ensembl
database:
 - exon_stable_id - maps between stable and internal Ensembl exon IDs;
 - exon_transcript - maps internal Ensembl exon IDs to "internal
Ensembl transcript ID + rank" pairs;
 - transcript_stable_id - maps between stable and internal Ensembl
transcript IDs;


        The Schema

1. Transcript annotations: this one seems easy, an SQLTable connected
to ensGene and using the transcript-ID column as the primary key;

2. Gene annotation: at the first glance, as above but using the gene-ID
column as the primary key. One issue to consider here is alternative
splicing - with this implementation it's not possible to query genes
with multiple transcripts;

3. Protein annotation: get the SQLTable used for transcript
annotations, then connect it with MapView to another one attached to
ensGtp, with the protein-ID column as the primary key. For each protein
ID specified by the user use the MapView to obtain the corresponding
transcript ID, then return appropriate transcript information. One
question: shall we hide from the user that protein data (s)he gets is
actually transcript data (i.e come up with some sort of a wrapper) or
just let him/her have it as it is?

4. Exon annotations: this is where things get a bit more complicated:
 a. First, get the stable Ensembl transcript ID corresponding to the
provided stable Ensembl exon ID. This requires a three-table join but
since it's still just a one-to-one mapping, MapView could be used (e.g.
from Ensembl's exon_stable_id to UCSC's ensGene);
 b. Next, extract exon blobs from the relevant row of ensGene and parse
them so that information pertaining to individual exons can be accessed;
 c. Afterwards, query the Ensembl database again but in the opposite
direction - get all stable Ensembl exon IDs for the relevant
transcript, preserving rank information somehow (either explicitly or
via appropriate ordering). This would most likely require a GraphView
from ensGene to exon_stable_id;
 d. Finally, assemble exon tuples from data from both databases and
put them to a local dictionary. You can now use the original stable
Ensembl exon ID provided by the user to query that local dictionary,
moreover it's trivial to implement caching of already-assembled exons.


That's it. Please let me know what you think.

Cheers,
-- 
MS

--

You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to pygr-...@googlegroups.com.
To unsubscribe from this group, send email to 
pygr-dev+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en.

[pygr] More on Ensembl data in UCSC

Reply via email to