[Rdkit-discuss] SMARTS substructure queries with SQL conjunctions

2017-03-21 Thread Akos Kokai
Dear RDKit community,

I'm getting unexpected results when combining SMARTS substructure
comparisons in SQL statements, and I'd like to ask for feedback to help me
understand what's going on.

Given an element, say Au, when I make a query like this:

SELECT cpds.cid FROM cpds WHERE (cpds.molecule @> '[Au]' ::qmol) AND NOT
(cpds.molecule @> '[C,c]~[C,c]' ::qmol) AND NOT (cpds.molecule @>
'[C!H0,c!H0]' ::qmol)

I don't expect to see any compounds with C-C or C-H bonds in the results.
Yet I get results like [(P(C5F5)3)4Au]Cl [1], or for example with Se,
[(CH3)3Se]+ [2]. Why?

It seems that usually my 'unexpected' results are matching one of the two
"AND NOT" conditions, not both (see console output below) but I haven't
checked systematically. I want the query to return only molecules for which
the last two substructure conditions are both false. Is my understanding of
SQL conjunctions mistaken?

I'm using RDKit 2016-03 and the rdkit extension on PostgreSQL 9.4. I'm
probably not using RDKit for what it was intended, but I'm certainly
grateful that it exists and is free software. I'd very much appreciate any
feedback on this question.

Best regards,
Akos

--

[1]: https://pubchem.ncbi.nlm.nih.gov/compound/11520592
[2]: https://pubchem.ncbi.nlm.nih.gov/compound/91580

Some console output regarding those compounds:

In [3]: mSe = Chem.MolFromSmiles('C[Se+](C)C')

In [4]: mAu =
Chem.MolFromSmiles('C1(=C(C(=C(C(=C1F)F)P(C2=C(C(=C(C(=C2F)F)F)F)F)C3=C(C(=C(C(=C3F)F)F)F)F
   ...: )F)F)F.Cl[Au]')

In [5]: mSe.HasSubstructMatch(Chem.MolFromSmarts('[C,c]~[C,c]'))
Out[5]: False

In [6]: mAu.HasSubstructMatch(Chem.MolFromSmarts('[C,c]~[C,c]'))
Out[6]: True

In [7]: mSe.HasSubstructMatch(Chem.MolFromSmarts('[C!H0,c!H0]'))
Out[7]: True

In [8]: mAu.HasSubstructMatch(Chem.MolFromSmarts('[C!H0,c!H0]'))
Out[8]: False


Akos Kokai 
PhD candidate, Department of Environmental Science, Policy & Management

Fellow, Berkeley Center for Green Chemistry 
University of California, Berkeley
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SMARTS substructure queries with SQL conjunctions

2017-03-21 Thread Chris Earnshaw
Hi Akos

Very strange behaviour. I don't see anything wrong with your SQL syntax.
I've tried equivalent searches in my 2.6M compound database and they give
the expected results. I used iodine rather than gold, for which there are
19504 structures. Adding the qualifying SQL clauses singly and in
combination gives -
SELECT structure FROM scr_structure WHERE structure @> 'I'::qmol AND NOT
structure @> '[#6]~[#6]'::qmol;
 Ic1nc[nH]n1
 Ic1nnc[nH]1
 O=P(O)(CI)CI
 ClI
 BrI
 [O-][I+3]([O-])([O-])O
 I[Cd]I
 [K]I
 [O-][I+2]([O-])O.[O-][I+2]([O-])O
 I[Hg]I
 [O-][I+2]([O-])O[I+2]([O-])[O-]
 [O-][I+2]([O-])O
 [Na]I
 ICI
 Cn1nnnc1I
(15 rows)

SELECT structure FROM scr_structure WHERE structure @> 'I'::qmol AND NOT
structure @> '[#6!H0]'::qmol;
 Nc1c(I)c(C(=O)O)c(I)c(C(=O)O)c1I
 O=C(O)c1c(I)c(I)c(I)c(I)c1C(=O)O
 Oc1nnc(I)c(O)n1
 O=c1[nH]c(Cl)c(I)c(=O)[nH]1
 O=c1[nH]nc(I)c(=O)[nH]1
 Fc1c(F)c(F)c(SC(I)=C(Sc2c(F)c(F)c(F)c(F)c2F)Sc2c(F)c(F)c(F)c(F)c2F)c(F)c1F
 ClI
 BrI
 [O-][I+3]([O-])([O-])O
 I[Cd]I
 [K]I
 [O-][I+2]([O-])O.[O-][I+2]([O-])O
 I[Hg]I
 [O-][I+2]([O-])O[I+2]([O-])[O-]
 [O-][I+2]([O-])O
 [Na]I
 Nc1nc(N)c(I)c(O)n11
 NC(=O)c1nn[nH]c1I
(18 rows)

SELECT structure FROM scr_structure WHERE structure @> 'I'::qmol AND NOT
structure @> '[#6]~[#6]'::qmol AND NOT structure @> '[#6!H0]'::qmol;
 ClI
 BrI
 [O-][I+3]([O-])([O-])O
 I[Cd]I
 [K]I
 [O-][I+2]([O-])O.[O-][I+2]([O-])O
 I[Hg]I
 [O-][I+2]([O-])O[I+2]([O-])[O-]
 [O-][I+2]([O-])O
 [Na]I
(10 rows)

This is with RDKit 2016-09 and postgresql-9.5. The returned hits are all
consistent with the additional qualifiers. BTW, the presence or absence of
parentheses around the clauses makes no difference, nor does replacing
[C,c] by [#6].

If the queries are running correctly (and not knowing the structure of your
database), is there any possibility of ambiguity in the cid values in the
cpds table? Do they have a primary key / uniqueness constraint? If any cids
are associated with multiple structures then the structure searching could
be working fine, but you might sometimes get spurious structures returned
from the cid lookup. You can check this with the query -
SELECT cid, COUNT(cid) FROM cpds GROUP BY cid HAVING COUNT(cid) > 1;
This will return any cid values which occur more than once in the cpds
table.

Failing that, the only time I've seen behaviour resembling your report was
with a database (using a different cartridge, in Oracle) where the
structure index had become corrupted. I don't know if that's even a
possibility with RDKit, but it might be worth considering an index rebuild
if everything else looks OK.

Best regards,
Chris


On 21 March 2017 at 06:34, Akos Kokai  wrote:

> Dear RDKit community,
>
> I'm getting unexpected results when combining SMARTS substructure
> comparisons in SQL statements, and I'd like to ask for feedback to help me
> understand what's going on.
>
> Given an element, say Au, when I make a query like this:
>
> SELECT cpds.cid FROM cpds WHERE (cpds.molecule @> '[Au]' ::qmol) AND NOT
> (cpds.molecule @> '[C,c]~[C,c]' ::qmol) AND NOT (cpds.molecule @>
> '[C!H0,c!H0]' ::qmol)
>
> I don't expect to see any compounds with C-C or C-H bonds in the results.
> Yet I get results like [(P(C5F5)3)4Au]Cl [1], or for example with Se,
> [(CH3)3Se]+ [2]. Why?
>
> It seems that usually my 'unexpected' results are matching one of the two
> "AND NOT" conditions, not both (see console output below) but I haven't
> checked systematically. I want the query to return only molecules for which
> the last two substructure conditions are both false. Is my understanding of
> SQL conjunctions mistaken?
>
> I'm using RDKit 2016-03 and the rdkit extension on PostgreSQL 9.4. I'm
> probably not using RDKit for what it was intended, but I'm certainly
> grateful that it exists and is free software. I'd very much appreciate any
> feedback on this question.
>
> Best regards,
> Akos
>
> --
>
> [1]: https://pubchem.ncbi.nlm.nih.gov/compound/11520592
> [2]: https://pubchem.ncbi.nlm.nih.gov/compound/91580
>
> Some console output regarding those compounds:
>
> In [3]: mSe = Chem.MolFromSmiles('C[Se+](C)C')
>
> In [4]: mAu = Chem.MolFromSmiles('C1(=C(C(=C(C(=C1F)F)P(C2=C(C(=C(C(=C2F)
> F)F)F)F)C3=C(C(=C(C(=C3F)F)F)F)F
>...: )F)F)F.Cl[Au]')
>
> In [5]: mSe.HasSubstructMatch(Chem.MolFromSmarts('[C,c]~[C,c]'))
> Out[5]: False
>
> In [6]: mAu.HasSubstructMatch(Chem.MolFromSmarts('[C,c]~[C,c]'))
> Out[6]: True
>
> In [7]: mSe.HasSubstructMatch(Chem.MolFromSmarts('[C!H0,c!H0]'))
> Out[7]: True
>
> In [8]: mAu.HasSubstructMatch(Chem.MolFromSmarts('[C!H0,c!H0]'))
> Out[8]: False
>
>
> Akos Kokai 
> PhD candidate, Department of Environmental Science, Policy & Management
> 
> Fellow, Berkeley Center for Green Chemistry 
> University of California, Berkeley
>
> 
> --
> Check out the vibrant tech community on one of the 

Re: [Rdkit-discuss] SMARTS substructure queries with SQL conjunctions

2017-03-21 Thread Greg Landrum
I agree with Chris' later comment that this doesn't look right.

Here's a simple test you can do in order to see if the right thing is
happening:

chembl_21=# select * from (select
'C1(=C(C(=C(C(=C1F)F)P(C2=C(C(=C(C(=C2F)F)F)F)F)C3=C(C(=C(C(=C3F)F)F)F)F)F)F)F.Cl[Au]'::mol
as mol) tmp where mol@>'[Au]'::qmol and NOT (mol@>'[C,c]~[C,c]'::qmol) and
NOT (mol @> '[C!H0,c!H0]'::qmol);

 mol

-

(0 rows)

-greg



On Tue, Mar 21, 2017 at 6:34 AM, Akos Kokai  wrote:

> Dear RDKit community,
>
> I'm getting unexpected results when combining SMARTS substructure
> comparisons in SQL statements, and I'd like to ask for feedback to help me
> understand what's going on.
>
> Given an element, say Au, when I make a query like this:
>
> SELECT cpds.cid FROM cpds WHERE (cpds.molecule @> '[Au]' ::qmol) AND NOT
> (cpds.molecule @> '[C,c]~[C,c]' ::qmol) AND NOT (cpds.molecule @>
> '[C!H0,c!H0]' ::qmol)
>
> I don't expect to see any compounds with C-C or C-H bonds in the results.
> Yet I get results like [(P(C5F5)3)4Au]Cl [1], or for example with Se,
> [(CH3)3Se]+ [2]. Why?
>
> It seems that usually my 'unexpected' results are matching one of the two
> "AND NOT" conditions, not both (see console output below) but I haven't
> checked systematically. I want the query to return only molecules for which
> the last two substructure conditions are both false. Is my understanding of
> SQL conjunctions mistaken?
>
> I'm using RDKit 2016-03 and the rdkit extension on PostgreSQL 9.4. I'm
> probably not using RDKit for what it was intended, but I'm certainly
> grateful that it exists and is free software. I'd very much appreciate any
> feedback on this question.
>
> Best regards,
> Akos
>
> --
>
> [1]: https://pubchem.ncbi.nlm.nih.gov/compound/11520592
> [2]: https://pubchem.ncbi.nlm.nih.gov/compound/91580
>
> Some console output regarding those compounds:
>
> In [3]: mSe = Chem.MolFromSmiles('C[Se+](C)C')
>
> In [4]: mAu = Chem.MolFromSmiles('C1(=C(C(=C(C(=C1F)F)P(C2=C(C(=C(C(=C2F)
> F)F)F)F)C3=C(C(=C(C(=C3F)F)F)F)F
>...: )F)F)F.Cl[Au]')
>
> In [5]: mSe.HasSubstructMatch(Chem.MolFromSmarts('[C,c]~[C,c]'))
> Out[5]: False
>
> In [6]: mAu.HasSubstructMatch(Chem.MolFromSmarts('[C,c]~[C,c]'))
> Out[6]: True
>
> In [7]: mSe.HasSubstructMatch(Chem.MolFromSmarts('[C!H0,c!H0]'))
> Out[7]: True
>
> In [8]: mAu.HasSubstructMatch(Chem.MolFromSmarts('[C!H0,c!H0]'))
> Out[8]: False
>
>
> Akos Kokai 
> PhD candidate, Department of Environmental Science, Policy & Management
> 
> Fellow, Berkeley Center for Green Chemistry 
> University of California, Berkeley
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SMARTS substructure queries with SQL conjunctions

2017-03-21 Thread Akos Kokai
Hi Chris and Greg,

Thank you for helping me identify possible problems.

- The query that Chris suggested to identify count(cid) > 1 returned 0
rows. I was kind of expecting this to be the glaringly obvious problem, but
maybe it's more subtle.
- Greg's test also returned 0 rows. That is reassuring.

The likely culprit is the database itself, which I put together using IDs
from US EPA's CompTox Dashboard. I did some checks on identifier ambiguity
while doing it, but it was also the first time I had ever used any form of
RDBMS. I did not, for example, use any constraints (!). I will re-examine
or redo that when I get a chance (and come back to you if it ends up still
looking like an RDKit-related problem after all).

Once again, thank you for your help.

Yours,
Akos

Akos Kokai 
PhD candidate, Department of Environmental Science, Policy & Management

Fellow, Berkeley Center for Green Chemistry 
University of California, Berkeley
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss