Re: [Rdkit-discuss] Structure Search Engine for All Major RDBMSs

Igor Filippov [Contr] Thu, 26 Feb 2009 16:33:00 +0000

Interesting, but 90,000 structures is a tiny database by today's
standards. For CSLS - http://cactus.nci.nih.gov/cgi-bin/lookup/search we
have to deal with 46 million unique structures so in-memory fingerprints
might run out of memory and into problems. Another point worth
mentioning is that similarity search is not the same as substructure
search and it seems like Duan is equaling the two.


Igor

On Thu, 2009-02-26 at 15:47 +0000, baoilleach wrote:
> Database searching using fingerprints seems to be a topic of interest
> at the moment...
> 
>  
>  
> Sent to you by baoilleach via Google Reader:
>  
>  
> Structure Search Engine for All Major RDBMSs
> via ChemHack by Duan Lian on 2/26/09
> 
>  
> 
> FingerPrint
> 
> Many methods of doing substructure search directly in SQL has been
> reported recently, Adel Golovin and Kim Henrick’s Chemical
> Substructure in SQL, Rich Apodaca’s fingerprint based MySQL
> substructure search in MySQL, and Charlie Zhu’s Microsoft SQL Server
> based substructure search with SMARTS support. 
> 
> Doing this in RDBMSs do have a number of advantages, “including
> platform independency, simplicity, flexibility, integrity, robustness
> and single point of failure”, as Adel and Kim describes. But some
> light weight RDBMSs such as MySQL and PostgreSQL, the most widely used
> open source ones, provide very limited SQL programming function, a
> pure SQL based solution may be impossible. 
> 
> Plugins are developed to enhance the functionality. For MySQL, there’s
> an open source project called mychem. For PostgreSQL, there’s
> pgchem:tigress which is also open source. Both of them is based on
> OpenBabel, a C++ chemoinformatics library. 
> 
> On Oracle platform, there’re CambridgeSoft Oracle Cartridge, Symyx
> Direct, JChem Cartridge, etc.. As Oracle is a commercial platform, not
> of these above is free.
> 
> When I was developing chemsoso.com, a Chinese chemical supplier
> database, structure search feature is an important problem to be
> solved. The database contains 90,000 different chemicals in total and
> still growing, performance needs to be carefully dealt with. 
> 
> In consideration of speed, fingerprint is obviously the best choice.
> It takes time to generate fingerprints, but in the search stage, bit
> operation are much less consuming than graph matching. My initial idea
> is to generate fingerprint in Java and do bit operation in MySQL.
> Unfortunately, MySQL has restrictions on bit operation, it limit the
> maximum range to 64 bits. In Rich’s solution, fingerprint is separated
> into multi fileds to satisfy MySQL’s requirement. Substructure search
> is possible in this method. But similar search where Tanimoto
> coefficient needs to be calculated is still impossible, as more bit
> operation function is missing in MySQL.
> 
> In my final solution, a in-memory fingerprint index outside MySQL is
> created. Molecule structure information(SMILES or mol file) is stored
> in MySQL, my search engine synchronize data between the in-memory
> index and MySQL table. Structure searching is performed directly on
> the in-memory index, this guarantee the performance. On a MacBook with
> 1.83GHz CPU, it only takes about 50ms to do substructure search on
> chemsoso.com’s 90,000 structures. For similar structure search, it
> takes about 300ms, for full structure search, the time is less than
> 10ms. If search boundary if set according to similarity requirement,
> we can have another 4X to 100X performance improvement depends on the
> complexity of the query molecule structure.
> 
> Several days ago, Charlie Zhu talked to me, wondering how
> chemsoso.com’s structure search engine works. As the structure was
> mainly built from open source libraries, I decide to make the search
> engine open source to ease the work of building chemistry related
> database. With the search engine released, developers can focus the
> database’s own functionality, instead of dealing with structure
> search.
> 
> Engine Structure
> 
> Before I can release the search engine, I have to find a way to cut in
> users’ system in the form of plugin. Including source code directly
> into users’ project may be the fastest way to add structure search
> functionality, but my code is in Java does not means everyone’s
> project is in Java. I want the search engine works not only with all
> major RDBMSs, but also all major OSs and all major programming
> languages. Besides Java API, command API and HTTP API will also be
> provided to make sure the search engine works with multi programming
> language and network environment where server clusters exists. 
> 
> 
> 
>  
>  
> Things you can do from here:
>       * Subscribe to ChemHack using Google Reader
>       * Get started using Google Reader to easily keep up with all
>         your favorite sites
>  
>  
> ------------------------------------------------------------------------------
> Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
> -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
> -Strategies to boost innovation and cut costs with open source participation
> -Receive a $600 discount off the registration fee with the source code: SFAD
> http://p.sf.net/sfu/XcvMzF8H
> _______________________________________________ Rdkit-discuss mailing list 
> [email protected] 
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Structure Search Engine for All Major RDBMSs

Reply via email to