Hello everyone, Now that I've got a working prototype for this (which you can find in my 'ucsc_ensembl' GitHub branch), Chris believes it is time to discuss the final form - or, in other words, the schema. Here is some information about the data in question and what needs to be done to obtain different types of annotations.
The Data 1. All the Ensembl data in UCSC is *transcript* data. You can get transcript, gene, protein and/or exon annotations from the database but in the end it all grinds down to appropriately manipulating transcript data; 2. UCSC stores stable Ensembl identifiers of transcripts, genes and proteins - but not exons. The latter can only be referred to by combining the relevant stable Ensembl transcript ID with the exon's offset (or, in Ensembl terminology, rank) within that transcript. 3. All Ensembl data in the UCSC database is stored in the following three tables: - ensGene - contains transcript information, including blobs containing exon data. Each row is identified by either stable Ensembl transcript ID (uniquely it seems, even though it's not guaranteed by the database) or stable Ensembl gene ID (not always uniquely); - ensPep - contains sequence data. Each row is identified by stable Ensembl transcript ID (as primary key); - ensGtp - contains mappings between stable Ensembl transcript, gene and protein IDs. Transcript IDs are unique, the other two - not necessarily; 4. We are interested in the following tables in the native Ensembl database: - exon_stable_id - maps between stable and internal Ensembl exon IDs; - exon_transcript - maps internal Ensembl exon IDs to "internal Ensembl transcript ID + rank" pairs; - transcript_stable_id - maps between stable and internal Ensembl transcript IDs; The Schema 1. Transcript annotations: this one seems easy, an SQLTable connected to ensGene and using the transcript-ID column as the primary key; 2. Gene annotation: at the first glance, as above but using the gene-ID column as the primary key. One issue to consider here is alternative splicing - with this implementation it's not possible to query genes with multiple transcripts; 3. Protein annotation: get the SQLTable used for transcript annotations, then connect it with MapView to another one attached to ensGtp, with the protein-ID column as the primary key. For each protein ID specified by the user use the MapView to obtain the corresponding transcript ID, then return appropriate transcript information. One question: shall we hide from the user that protein data (s)he gets is actually transcript data (i.e come up with some sort of a wrapper) or just let him/her have it as it is? 4. Exon annotations: this is where things get a bit more complicated: a. First, get the stable Ensembl transcript ID corresponding to the provided stable Ensembl exon ID. This requires a three-table join but since it's still just a one-to-one mapping, MapView could be used (e.g. from Ensembl's exon_stable_id to UCSC's ensGene); b. Next, extract exon blobs from the relevant row of ensGene and parse them so that information pertaining to individual exons can be accessed; c. Afterwards, query the Ensembl database again but in the opposite direction - get all stable Ensembl exon IDs for the relevant transcript, preserving rank information somehow (either explicitly or via appropriate ordering). This would most likely require a GraphView from ensGene to exon_stable_id; d. Finally, assemble exon tuples from data from both databases and put them to a local dictionary. You can now use the original stable Ensembl exon ID provided by the user to query that local dictionary, moreover it's trivial to implement caching of already-assembled exons. That's it. Please let me know what you think. Cheers, -- MS -- You received this message because you are subscribed to the Google Groups "pygr-dev" group. To post to this group, send email to pygr-...@googlegroups.com. To unsubscribe from this group, send email to pygr-dev+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/pygr-dev?hl=en.