Database searching using fingerprints seems to be a topic of interest
at the moment...

Sent to you by baoilleach via Google Reader: Structure Search Engine
for All Major RDBMSs via ChemHack by Duan Lian on 2/26/09




Many methods of doing substructure search directly in SQL has been
reported recently, Adel Golovin and Kim Henrick’s Chemical Substructure
in SQL, Rich Apodaca’s fingerprint based MySQL substructure search in
MySQL, and Charlie Zhu’s Microsoft SQL Server based substructure search
with SMARTS support.

Doing this in RDBMSs do have a number of advantages, “including
platform independency, simplicity, flexibility, integrity, robustness
and single point of failure”, as Adel and Kim describes. But some light
weight RDBMSs such as MySQL and PostgreSQL, the most widely used open
source ones, provide very limited SQL programming function, a pure SQL
based solution may be impossible.

Plugins are developed to enhance the functionality. For MySQL, there’s
an open source project called mychem. For PostgreSQL, there’s
pgchem:tigress which is also open source. Both of them is based on
OpenBabel, a C++ chemoinformatics library.

On Oracle platform, there’re CambridgeSoft Oracle Cartridge, Symyx
Direct, JChem Cartridge, etc.. As Oracle is a commercial platform, not
of these above is free.

When I was developing chemsoso.com, a Chinese chemical supplier
database, structure search feature is an important problem to be
solved. The database contains 90,000 different chemicals in total and
still growing, performance needs to be carefully dealt with.

In consideration of speed, fingerprint is obviously the best choice. It
takes time to generate fingerprints, but in the search stage, bit
operation are much less consuming than graph matching. My initial idea
is to generate fingerprint in Java and do bit operation in MySQL.
Unfortunately, MySQL has restrictions on bit operation, it limit the
maximum range to 64 bits. In Rich’s solution, fingerprint is separated
into multi fileds to satisfy MySQL’s requirement. Substructure search
is possible in this method. But similar search where Tanimoto
coefficient needs to be calculated is still impossible, as more bit
operation function is missing in MySQL.

In my final solution, a in-memory fingerprint index outside MySQL is
created. Molecule structure information(SMILES or mol file) is stored
in MySQL, my search engine synchronize data between the in-memory index
and MySQL table. Structure searching is performed directly on the
in-memory index, this guarantee the performance. On a MacBook with
1.83GHz CPU, it only takes about 50ms to do substructure search on
chemsoso.com’s 90,000 structures. For similar structure search, it
takes about 300ms, for full structure search, the time is less than
10ms. If search boundary if set according to similarity requirement, we
can have another 4X to 100X performance improvement depends on the
complexity of the query molecule structure.

Several days ago, Charlie Zhu talked to me, wondering how
chemsoso.com’s structure search engine works. As the structure was
mainly built from open source libraries, I decide to make the search
engine open source to ease the work of building chemistry related
database. With the search engine released, developers can focus the
database’s own functionality, instead of dealing with structure search.


Before I can release the search engine, I have to find a way to cut in
users’ system in the form of plugin. Including source code directly
into users’ project may be the fastest way to add structure search
functionality, but my code is in Java does not means everyone’s project
is in Java. I want the search engine works not only with all major
RDBMSs, but also all major OSs and all major programming languages.
Besides Java API, command API and HTTP API will also be provided to
make sure the search engine works with multi programming language and
network environment where server clusters exists.

Things you can do from here:
- Subscribe to ChemHack using Google Reader
- Get started using Google Reader to easily keep up with all your
favorite sites

Reply via email to