Hi Akos

Very strange behaviour. I don't see anything wrong with your SQL syntax.
I've tried equivalent searches in my 2.6M compound database and they give
the expected results. I used iodine rather than gold, for which there are
19504 structures. Adding the qualifying SQL clauses singly and in
combination gives -
SELECT structure FROM scr_structure WHERE structure @> 'I'::qmol AND NOT
structure @> '[#6]~[#6]'::qmol;
 Ic1nc[nH]n1
 Ic1nnc[nH]1
 O=P(O)(CI)CI
 ClI
 BrI
 [O-][I+3]([O-])([O-])O
 I[Cd]I
 [K]I
 [O-][I+2]([O-])O.[O-][I+2]([O-])O
 I[Hg]I
 [O-][I+2]([O-])O[I+2]([O-])[O-]
 [O-][I+2]([O-])O
 [Na]I
 ICI
 Cn1nnnc1I
(15 rows)

SELECT structure FROM scr_structure WHERE structure @> 'I'::qmol AND NOT
structure @> '[#6!H0]'::qmol;
 Nc1c(I)c(C(=O)O)c(I)c(C(=O)O)c1I
 O=C(O)c1c(I)c(I)c(I)c(I)c1C(=O)O
 Oc1nnc(I)c(O)n1
 O=c1[nH]c(Cl)c(I)c(=O)[nH]1
 O=c1[nH]nc(I)c(=O)[nH]1
 Fc1c(F)c(F)c(SC(I)=C(Sc2c(F)c(F)c(F)c(F)c2F)Sc2c(F)c(F)c(F)c(F)c2F)c(F)c1F
 ClI
 BrI
 [O-][I+3]([O-])([O-])O
 I[Cd]I
 [K]I
 [O-][I+2]([O-])O.[O-][I+2]([O-])O
 I[Hg]I
 [O-][I+2]([O-])O[I+2]([O-])[O-]
 [O-][I+2]([O-])O
 [Na]I
 Nc1nc(N)c(I)c(O)n11
 NC(=O)c1nn[nH]c1I
(18 rows)

SELECT structure FROM scr_structure WHERE structure @> 'I'::qmol AND NOT
structure @> '[#6]~[#6]'::qmol AND NOT structure @> '[#6!H0]'::qmol;
 ClI
 BrI
 [O-][I+3]([O-])([O-])O
 I[Cd]I
 [K]I
 [O-][I+2]([O-])O.[O-][I+2]([O-])O
 I[Hg]I
 [O-][I+2]([O-])O[I+2]([O-])[O-]
 [O-][I+2]([O-])O
 [Na]I
(10 rows)

This is with RDKit 2016-09 and postgresql-9.5. The returned hits are all
consistent with the additional qualifiers. BTW, the presence or absence of
parentheses around the clauses makes no difference, nor does replacing
[C,c] by [#6].

If the queries are running correctly (and not knowing the structure of your
database), is there any possibility of ambiguity in the cid values in the
cpds table? Do they have a primary key / uniqueness constraint? If any cids
are associated with multiple structures then the structure searching could
be working fine, but you might sometimes get spurious structures returned
from the cid lookup. You can check this with the query -
SELECT cid, COUNT(cid) FROM cpds GROUP BY cid HAVING COUNT(cid) > 1;
This will return any cid values which occur more than once in the cpds
table.

Failing that, the only time I've seen behaviour resembling your report was
with a database (using a different cartridge, in Oracle) where the
structure index had become corrupted. I don't know if that's even a
possibility with RDKit, but it might be worth considering an index rebuild
if everything else looks OK.

Best regards,
Chris


On 21 March 2017 at 06:34, Akos Kokai <ako...@berkeley.edu> wrote:

> Dear RDKit community,
>
> I'm getting unexpected results when combining SMARTS substructure
> comparisons in SQL statements, and I'd like to ask for feedback to help me
> understand what's going on.
>
> Given an element, say Au, when I make a query like this:
>
> SELECT cpds.cid FROM cpds WHERE (cpds.molecule @> '[Au]' ::qmol) AND NOT
> (cpds.molecule @> '[C,c]~[C,c]' ::qmol) AND NOT (cpds.molecule @>
> '[C!H0,c!H0]' ::qmol)
>
> I don't expect to see any compounds with C-C or C-H bonds in the results.
> Yet I get results like [(P(C5F5)3)4Au]Cl [1], or for example with Se,
> [(CH3)3Se]+ [2]. Why?
>
> It seems that usually my 'unexpected' results are matching one of the two
> "AND NOT" conditions, not both (see console output below) but I haven't
> checked systematically. I want the query to return only molecules for which
> the last two substructure conditions are both false. Is my understanding of
> SQL conjunctions mistaken?
>
> I'm using RDKit 2016-03 and the rdkit extension on PostgreSQL 9.4. I'm
> probably not using RDKit for what it was intended, but I'm certainly
> grateful that it exists and is free software. I'd very much appreciate any
> feedback on this question.
>
> Best regards,
> Akos
>
> --
>
> [1]: https://pubchem.ncbi.nlm.nih.gov/compound/11520592
> [2]: https://pubchem.ncbi.nlm.nih.gov/compound/91580
>
> Some console output regarding those compounds:
>
> In [3]: mSe = Chem.MolFromSmiles('C[Se+](C)C')
>
> In [4]: mAu = Chem.MolFromSmiles('C1(=C(C(=C(C(=C1F)F)P(C2=C(C(=C(C(=C2F)
> F)F)F)F)C3=C(C(=C(C(=C3F)F)F)F)F
>    ...: )F)F)F.Cl[Au]')
>
> In [5]: mSe.HasSubstructMatch(Chem.MolFromSmarts('[C,c]~[C,c]'))
> Out[5]: False
>
> In [6]: mAu.HasSubstructMatch(Chem.MolFromSmarts('[C,c]~[C,c]'))
> Out[6]: True
>
> In [7]: mSe.HasSubstructMatch(Chem.MolFromSmarts('[C!H0,c!H0]'))
> Out[7]: True
>
> In [8]: mAu.HasSubstructMatch(Chem.MolFromSmarts('[C!H0,c!H0]'))
> Out[8]: False
>
>
> Akos Kokai <http://kaios.net/>
> PhD candidate, Department of Environmental Science, Policy & Management
> <http://ourenvironment.berkeley.edu/>
> Fellow, Berkeley Center for Green Chemistry <http://bcgc.berkeley.edu/>
> University of California, Berkeley
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to