Hi Daniel,

Looks like there is a difficult to reproduce bug with the interaction of
exact structure search and the index.
I will try to track it down.

In the meantime, it's probably good to be aware that the @= operator can,
under the best of circumstances, behave somewhat oddly. This is a
consequence of the less-than-satisfactory way the RDKit current implements
molecular equality. Some examples are below. If you want to search for
identical molecules, I'd suggest adding a SMILES column to your database
built using the mol_to_smiles(m) function, build an index on that, and
query with molecules that have been run through mol_to_smiles(m). This is a
bit painful and adds an extra computation step, but it's more likely to
generate a correct result than relying on the molecular equality function.

Here are the examples:

contrib_regression=# select 'C[17OH]'::mol @= 'CO'::mol ;

 ?column?

----------

 f

(1 row)


contrib_regression=# select 'C[17OH]'::mol @= 'C[17OH]'::mol ;

 ?column?

----------

 t

(1 row)


contrib_regression=# select 'CO'::mol @= 'C[17OH]'::mol ;

 ?column?

----------

 f

(1 row)

On Fri, Oct 31, 2014 at 2:10 PM, Daniel Moser <
mo...@pharmchem.uni-frankfurt.de> wrote:

>  Hi all,
>
>
>
> since the new release I‘m experiencing problems with exact structure
> search in the cartridge. If an index is defined on the mol column exact
> structure search ( @= ) doesn’t work (i.e. yields no results). I tried it
> with rdkit compiled from source under CentOS 6.5 and with the RPMs from
> Gianluca Sforna for Fedora 20. In both cases postgres 9.3 was used. Can
> anyone confirm this or am I missing something?
>
>
>
> Here’s what I’ve done (based on the emolecules example from the docs):
>
>
>
> RDKit 2014.09 (not working):
>
> #####################
>
> [moe@localhost db]$ createdb emolecules
>
> [moe@localhost db]$ psql -c 'create extension rdkit' emolecules
>
> CREATE EXTENSION
>
> [moe@localhost db]$ psql -c 'SELECT rdkit_version()' emolecules
>
> rdkit_version
>
> ---------------
>
> 0.73.0
>
> (1 row)
>
>
>
> [moe@localhost db]$ wget
> http://downloads.emolecules.com/free/2014-10-01/version.smi.gz
>
> [...]
>
> 2014-10-31 10:52:29 (1,08 MB/s) - ‘version.smi.gz’ saved
> [88871202/88871202]
>
>
>
> [moe@localhost db]$ psql -c 'create table raw_data (id SERIAL, smiles
> text, emol_id integer, parent_id integer)' emolecules
>
> CREATE TABLE
>
> [moe@localhost db]$ zcat version.smi.gz | sed '1d; s/\\/\\\\/g' | psql -c
> "copy raw_data (smiles,emol_id,parent_id) from stdin with delimiter ' '"
> emolecules
>
> [moe@localhost db]$ psql emolecules
>
> psql (9.3.5)
>
> Type "help" for help.
>
>
>
> emolecules=# SELECT * INTO mols FROM (SELECT
> id,mol_from_smiles(smiles::cstring) m FROM raw_data) tmp WHERE m IS NOT
> null LIMIT 1000;
>
> SELECT 1000
>
> emolecules=# SELECT * FROM mols WHERE m @= Mol_From_Smiles('CCOC(=N)CC');
>
> id  |     m
>
> -----+------------
>
> 383 | CCOC(=N)CC
>
> (1 row)
>
>
>
> emolecules=# CREATE INDEX molidx ON mols USING gist(m);
>
> CREATE INDEX
>
> emolecules=# SELECT * FROM mols WHERE m @= Mol_From_Smiles('CCOC(=N)CC');
>
> id | m
>
> ----+---
>
> (0 rows)
>
>
>
> emolecules=# DROP INDEX molidx; DROP TABLE mols; SELECT * INTO mols FROM
> (SELECT id,mol_from_smiles(smiles::cstring) m FROM raw_data) tmp WHERE m IS
> NOT null LIMIT 1000;  SELECT * FROM mols WHERE m @=
> Mol_From_Smiles('CCOC(=N)CC');
>
> DROP INDEX
>
> DROP TABLE
>
> SELECT 1000
>
> id  |     m
>
> -----+------------
>
> 383 | CCOC(=N)CC
>
> (1 row)
>
> #####################
>
>
>
>
>
> RDKit 2014.03 (working):
>
> #####################
>
> -bash-4.1$ createdb emolecules_test
>
> -bash-4.1$ psql -c 'SELECT rdkit_version()' emolecules_test
>
> rdkit_version
>
> ---------------
>
> 0.72.0
>
> (1 row)
>
>
>
> -bash-4.1$ psql -c 'create table raw_data (id SERIAL, smiles text, emol_id
> integer, parent_id integer)' emolecules_test
>
> CREATE TABLE
>
> -bash-4.1$ zcat version.smi.gz | sed '1d; s/\\/\\\\/g' | psql -c "copy
> raw_data (smiles,emol_id,parent_id) from stdin with delimiter ' '"
> emolecules_test
>
> -bash-4.1$ psql emolecules_test
>
> psql (9.3.4)
>
> Type "help" for help.
>
>
>
> emolecules_test=# SELECT * INTO mols FROM (SELECT
> id,mol_from_smiles(smiles::cstring) m FROM raw_data) tmp WHERE m IS NOT
> null LIMIT 1000;
>
> SELECT 1000
>
> emolecules_test=# SELECT * FROM mols WHERE m @=
> Mol_From_Smiles('CCOC(=N)CC');
>
> id  |     m
>
> -----+------------
>
> 383 | CCOC(=N)CC
>
> (1 row)
>
>
>
> emolecules_test=# CREATE INDEX molidx ON mols USING gist(m);
>
> CREATE INDEX
>
> emolecules_test=# SELECT * FROM mols WHERE m @=
> Mol_From_Smiles('CCOC(=N)CC');
>
> id  |     m
>
> -----+------------
>
> 383 | CCOC(=N)CC
>
> (1 row)
>
>
>
> emolecules_test=# DROP INDEX molidx; DROP TABLE mols; SELECT * INTO mols
> FROM (SELECT id,mol_from_smiles(smiles::cstring) m FROM raw_data) tmp WHERE
> m IS NOT null LIMIT 1000;  SELECT * FROM mols WHERE m @=
> Mol_From_Smiles('CCOC(=N)CC');
>
> DROP INDEX
>
> DROP TABLE
>
> SELECT 1000
>
> id  |     m
>
> -----+------------
>
> 383 | CCOC(=N)CC
>
> (1 row)
>
> #####################
>
>
>
>
>
>
>
> Best,
>
> Daniel
>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to