[pygr] Re: Support for Ensembl data in UCSC

Christopher Lee Wed, 11 Nov 2009 15:49:03 -0800


On Nov 11, 2009, at 2:37 PM, Marek Szuba wrote:

>
> On Wed, 11 Nov 2009 12:34:00 -0800
> "C. Titus Brown" <c...@msu.edu> wrote:
>
>>> 1. Can we live without original Ensembl exon IDs?
>> I can ;).  But I don't understand why we would need to.
> Well, if we absolutely and positively NEED Ensembl exon IDs then the
> whole idea of interfacing with Ensembl via UCSC is useless and we need
> to get back to talking directly to Ensembl.

I think you're expressing this too pessimistically.  At a minimum, the  
UCSC exon annotation should make it dramatically easier to join  
against the ensemble exon annotations.  In 90% of cases exon  
annotations from these two sources will be uniquely mappable to each  
other simply by matching their sequences; the remaining ambiguities  
should be resolvable by the context of their neighboring exons.  In  
other words, let's consider a hierarchy of three approaches ordered by  
increasing amount of work:

1. UCSC supplies ensembl exon IDs, so we're done.

2. We run an automated JOIN process that builds a mapping of UCSC exon  
annotations to Ensembl exon IDs.  If there is a tiny fraction that  
cannot be mapped, that is not a big problem.  We make this mapping  
available in Worldbase.

3. we give up on UCSC altogether and revert to trying to figure out  
how to map ensembl coordinates to UCSC genome coordinates.  But we've  
been trying to get that information for 2 - 3 years now with no success.

If option #1 is out, then let's consider option #2.

>
>> Yes, it should be easy to do.  I'm not entirely sure what the best
>> mechanism will be though; is the problem that individual exon info
>> will have to be extracted from the blob dynamically?
> Pretty much...

This is no problem.  We just need to decide on a scheme for assigning  
each UCSC exon a unique ID.  I guess I'd advocate just using a string  
consisting of chromosome ID + start + stop.  E.g. something like  
"1.10000:10150"

>
>> I could imagine an ensGene wrapper object with an exons list-like
>> object that in turn dynamically pulls exon information out of the
>> blobs, e.g. code like this
> [...]
>> Is this the sort of thing we need?
> Yup, that's it - possibly accompanied by caching of already-located
> exons.

Sure, no problem.  Using the exon ID scheme proposed above, this  
becomes utterly trivial -- a class whose __getitem__() just echoes  
back the chromosome ID, start, stop for the annotation DB to look up  
from the genome db...

-- Chris 

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to pygr-dev@googlegroups.com
To unsubscribe from this group, send email to 
pygr-dev+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[pygr] Re: Support for Ensembl data in UCSC

Reply via email to