Re: [pygr] Re: Support for Ensembl data in UCSC

Marek Szuba Tue, 17 Nov 2009 17:51:47 -0800

Hello everyone,

I have just pushed the first version of Ensembl-in-UCSC code to my
GitHub repository ('ucsc_ensembl' branch):


http://github.com/mkszuba/pygr/commit/193d2d24a472d2deb1090128168c057775185164

Its status is as follows:

1. Transcript and gene AnnotationDBs can be created and used without
problems, talking directly to the UCSC MySQL server;

2. Exon annotations are successfully extracted from transcripts they
are embedded in and used to create a working exon AnnotationDB, with
exon identifier format of 'transcript_id:rank'. This of course requires
parsing of all Ensembl transcripts in the UCSC database, which takes
about 12 minutes; while it is not a problem if we end up packing these
databases into NLMSAs and storing them in worldbase, allowing users to
talk to UCSC directly would likely require a redesign which would e.g.
only fetch and parse transcripts on demand;

3. Fetching original exon IDs from Ensembl does NOT work yet: while all
that would require is a three table-join SELECT query, for reasons
unclear to me I get a 'no database selected' error (to have the error
appear, remove the try..except block from lines 27-31);

4. There is no support for protein annotations yet. In principle this
is trivial to achieve - just use the table ensGtp to translate protein
ID to transcript ID, then return appropriate transcript data - but I do
not know how to attach SQLTable to a join result instead of a real
table (NB. for the same reason I used cursor.execute trying to get
Ensembl exon ID), and building this AnnotationDB on the client side
(a'la the one for exons) feels rather wasteful;

5. The list of Ensembl database names for different versions of their
data is hardcoded: the names in question follow the format
'homo_sapiens_core_XX_YYY' and while 'XX', the actual version number,
is trivial to obtain from UCSC, 'YYY' is not. Not sure how we should
proceed here.

Cheers,
-- 
MS

--

You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to pygr-...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=.

Re: [pygr] Re: Support for Ensembl data in UCSC

Reply via email to