Re: [pygr] Ensembl-UCSC interface

Christopher Lee Sun, 07 Feb 2010 13:51:09 -0800

Hi,
I added a simple method for accessing any UCSC annotation database, in my 
sql_rekey branch.  You just specify the table name, and it gives you an 
AnnotationDB.  Example code:


>>> from pygr.apps.ucsc_ensembl_annot import UCSCEnsemblInterface
>>> factory = UCSCEnsemblInterface('Bio.Seq.Genome.HUMAN.hg18') # access hg18 
>>> UCSC database
>>> snp130 = factory.get_annot_db('snp130') # request snp130 table
>>> snp = list(snp130.query('WHERE name=%s', ('rs58108140',)))[0]
>>> print 'SNP:',snp.name, repr(snp.sequence), snp.refUCSC, snp.observed
SNP: rs58108140 chr1[582:583] G A/G

You can also search using a specific genomic region:

>>> ival = factory.genome_seq['chr1'][10000:11000]
>>> snps = list(snp130.query_interval(ival))
>>> print 'query:', repr(ival), len(snps), 'snps'
query: chr1[10000:11000] 73 snps
>>> snp = snps[0]
>>> print 'SNP:',snp.id, repr(snp.sequence), snp.refUCSC, snp.observed
SNP: chr1.10003.rs12354060 chr1[10003:10004] G A/G

The main challenge here was that almost all of the UCSC tables lack a primary 
key.  That means there is no way to uniquely designate a specific row.  This 
forced me to develop a new SQLTable class, SQLTableRekeyed, which lets you 
specify multiple columns that should be treated as the "identifier" for each 
row.  This necessitated new code for both querying and iteration using the new 
virtual identifiers.  Fortunately this is confined to the new class.  Above you 
can see that I made the default for the UCSC tables combine the chrom, 
chromStart and name columns to create a unique identifier.  I pushed the code 
for this to a new branch, sql_rekey; take a look.

Should we include this new feature in our upcoming release's support for 
UCSC-Ensembl datasets?  The opposing arguments:
- PRO: existing databases often have lousy schemas, so it is very convenient 
for Pygr to be able to work with these schemas through automatic features like 
this (generating a unique key when the table fails to provide one).  As the 
code above shows, we can give users a fairly easy way to access these datasets, 
which would otherwise be impossible (Pygr databases, like any Python 
dictionary, require unique keys).

- CON: this makes the Pygr code more complex.  So much of the real pain in 
developing and supporting Pygr's database features arises from trying to rescue 
users from external services that don't work correctly (e.g. database tables 
that lack a primary key; MySQL's hideous iteration performance).  By developing 
all these workarounds, we may be letting the  ugliness of those broken external 
services make the Pygr code more complex than it would otherwise have to be.

- CON2: the lack of a real primary key also slows performance on these tables.  
That is, Pygr will send the MySQL server correct ORDER BY queries for listing 
these columns, but because the server doesn't necessarily index those columns, 
iteration performance seems poor.

ALTERNATIVES?
The only real alternatives I can think of are:
- lobby UCSC to add a primary key to their tables!  I sent this question to 
their developers but haven't heard anything back yet.  How likely is it that 
they'll reindex all their tables just for us?

- port all the UCSC tables that we want to support to our own MySQL server and 
add primary keys to them.  Sounds like an enormous task, given the huge numbers 
of tracks in the UCSC database.  It also requires pointing users at our MySQL 
server instead of UCSC's; this might require setting up new servers and storage.

There are huge numbers of different tracks in the ucsc database server.  Since 
most of these follow the same column name conventions (chrom, chromStart, 
chromEnd, strand) they should work exactly as shown above.  In cases where the 
column names are different, just specify the correct column names, e.g.

>>> knownGenes = factory.get_annot_db('knownGenes', 
>>> sliceAttrDict=dict(id='chrom', start='txStart', stop='txEnd'))

Please take a look and give me your comments.

-- Chris

-- 
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to pygr-...@googlegroups.com.
To unsubscribe from this group, send email to 
pygr-dev+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en.

Re: [pygr] Ensembl-UCSC interface

Reply via email to