Lucene outperforms MySQL, BerkeleyDB, and PostgreSQL for genome map database searches.
GBrowse (Generic Genome Browser, http://www.gmod.org/) is a widely used program for displaying maps of genome data in biology/bioinformatics. One need it serves is helping biologists quickly and easily locate features of interest among 10s of millions of genome features for an organism. Lucene and the Lucegene project using it, find a good application for rapidly and easily searching the complex, diverse and large volume of genome data. These are useful for searching genome sequences, literature and experimental data, interactions among genes, as well as other categories of genome informations. Lucegene leverages the speed, high-volume capability and data-source adaptability of Lucene for searching the multi-gigabyte bioinformatics databases. Though focused more on text searches and less on numerics, the opposite of relational databases, Lucene is capable also at numeric searches such as the demanding use with genomes for displaying quickly to biologists the locations of their favorite genes and other features among millions of features spread across 100 millions of possible locations. Time (seconds) for GBrowse web display, 30 iterations at different map locations on fruitfly (dmel) genome ---------------------------------------------------------- Server3 Server2 Relative GBrowse-Adaptor Mean SE Mean SE time (ave.) dmel_lucegene_500k 5.4 0.15 1.86 0.05 100 dmel_lucene_500k 6.1 0.13 2.23 0.05 117 dmel_mysql_500k 7.9 0.31 2.14 0.06 128 dmel_bdb_500k 8.3 0.53 4.10 0.32 187 dmel_chadofc_500k 25.9 0.91 9.86 0.77 510 ---------------------------------------------------------- This uses a 500kb map range; differences increase with map range. These all use the same data. Most of the response time is used in drawing maps, once features are extracted from the database. However adaptor speed is one factor that can improve rapid displays. There are slight differences in displays due to configurations and how adaptor works, but no significant differences in the data returned by adaptors. Lucene and MySQL indices are cross-platform shared here. BerkeleyDB and Postgres cannot be, and had to be regenerated for each server. Server2 is x64-Solaris-10 (yr2005), Server3 is ppc-MacOSX-10.3 (yr2004). The fastest adaptor here, Lucegene, has algorithms tuned for genome map range searches. The simple lucene adaptor is comparable directly to the mysql and berkeleydb adaptors in operation, using Lucene as persistant searchable data storage without Lucene-optimized functions. These results, while not dramatic in the speed differences but for the slow Chado Postgres adaptor, add to the other values for this cross-platform, Java-based system, even when combined with Perl-based tools such as GBrowse. One important but difficult to measure factor is the cost of management, where genome data are frequently updated from diverse sources. Installing Lucene for this use is a simple matter of adding the Java library to map software. Lucene databases are easy to create from source data, and can be copied and shared across computer systems, where compiled software and binary databases usually need to be re-generated by informaticians. GBrowse Perl Adaptor key: lucegene - lucegene.pm GFF (Lucene v1.9; Java 1.4/1.5) lucene - simple lucene.pm GFF (Lucene v1.9; Java 1.4/1.5) bdb - berkeleydb.pm GFF (BerkeleyDB v4.2) mysql - mysqlopt.pm GFF (MySQL v4.0x) chadofc - chado.pm DAS, modified for flybase Chado db (Postgres v7 & 8) These are available through GMOD projects for use with GBrowse. Preliminary tests suggest that Lucene may outperform Lion Bioscience's SRS at basic bio-databank search and retrieval, such as with Uniprot database. See also http://sourceforge.net/mailarchive/forum.php?thread_id=8094404&forum_id=31947 http://www.gmod.org/, http://www.gmod.org/lucegene/, and http://lucene.apache.org/ The archive at ftp://ftp.eugenes.org/eugenes/gbrowse/ has a set of Lucene indices of genomes for Worm, Yeast, Rice, and 9 Fruitfly species, along with Gbrowse configuration files. You should be able to copy these, add to Gbrowse the Lucene-lite and Lucegene adaptors, and display the genomes from your favorite server computer. Example servers with these data and comparisons to other GBrowse adapators (Chado-Pg, MySQL, BerkeleyDB) are here: http://server2.eugenes.org/gbrowse/ (Sun-Solaris-x64) http://server3.eugenes.org/gbrowse/ (Apple-MacOSX-ppc) -- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405 -- [EMAIL PROTECTED]://marmot.bio.indiana.edu/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]