Hi, I added a simple method for accessing any UCSC annotation database, in my sql_rekey branch. You just specify the table name, and it gives you an AnnotationDB. Example code:
>>> from pygr.apps.ucsc_ensembl_annot import UCSCEnsemblInterface >>> factory = UCSCEnsemblInterface('Bio.Seq.Genome.HUMAN.hg18') # access hg18 >>> UCSC database >>> snp130 = factory.get_annot_db('snp130') # request snp130 table >>> snp = list(snp130.query('WHERE name=%s', ('rs58108140',)))[0] >>> print 'SNP:',snp.name, repr(snp.sequence), snp.refUCSC, snp.observed SNP: rs58108140 chr1[582:583] G A/G You can also search using a specific genomic region: >>> ival = factory.genome_seq['chr1'][10000:11000] >>> snps = list(snp130.query_interval(ival)) >>> print 'query:', repr(ival), len(snps), 'snps' query: chr1[10000:11000] 73 snps >>> snp = snps[0] >>> print 'SNP:',snp.id, repr(snp.sequence), snp.refUCSC, snp.observed SNP: chr1.10003.rs12354060 chr1[10003:10004] G A/G The main challenge here was that almost all of the UCSC tables lack a primary key. That means there is no way to uniquely designate a specific row. This forced me to develop a new SQLTable class, SQLTableRekeyed, which lets you specify multiple columns that should be treated as the "identifier" for each row. This necessitated new code for both querying and iteration using the new virtual identifiers. Fortunately this is confined to the new class. Above you can see that I made the default for the UCSC tables combine the chrom, chromStart and name columns to create a unique identifier. I pushed the code for this to a new branch, sql_rekey; take a look. Should we include this new feature in our upcoming release's support for UCSC-Ensembl datasets? The opposing arguments: - PRO: existing databases often have lousy schemas, so it is very convenient for Pygr to be able to work with these schemas through automatic features like this (generating a unique key when the table fails to provide one). As the code above shows, we can give users a fairly easy way to access these datasets, which would otherwise be impossible (Pygr databases, like any Python dictionary, require unique keys). - CON: this makes the Pygr code more complex. So much of the real pain in developing and supporting Pygr's database features arises from trying to rescue users from external services that don't work correctly (e.g. database tables that lack a primary key; MySQL's hideous iteration performance). By developing all these workarounds, we may be letting the ugliness of those broken external services make the Pygr code more complex than it would otherwise have to be. - CON2: the lack of a real primary key also slows performance on these tables. That is, Pygr will send the MySQL server correct ORDER BY queries for listing these columns, but because the server doesn't necessarily index those columns, iteration performance seems poor. ALTERNATIVES? The only real alternatives I can think of are: - lobby UCSC to add a primary key to their tables! I sent this question to their developers but haven't heard anything back yet. How likely is it that they'll reindex all their tables just for us? - port all the UCSC tables that we want to support to our own MySQL server and add primary keys to them. Sounds like an enormous task, given the huge numbers of tracks in the UCSC database. It also requires pointing users at our MySQL server instead of UCSC's; this might require setting up new servers and storage. There are huge numbers of different tracks in the ucsc database server. Since most of these follow the same column name conventions (chrom, chromStart, chromEnd, strand) they should work exactly as shown above. In cases where the column names are different, just specify the correct column names, e.g. >>> knownGenes = factory.get_annot_db('knownGenes', >>> sliceAttrDict=dict(id='chrom', start='txStart', stop='txEnd')) Please take a look and give me your comments. -- Chris -- You received this message because you are subscribed to the Google Groups "pygr-dev" group. To post to this group, send email to pygr-...@googlegroups.com. To unsubscribe from this group, send email to pygr-dev+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/pygr-dev?hl=en.