Re: [Rdkit-discuss] SMARTS substructure queries with SQL conjunctions

2017-03-21 Thread Akos Kokai
Hi Chris and Greg,

Thank you for helping me identify possible problems.

- The query that Chris suggested to identify count(cid) > 1 returned 0
rows. I was kind of expecting this to be the glaringly obvious problem, but
maybe it's more subtle.
- Greg's test also returned 0 rows. That is reassuring.

The likely culprit is the database itself, which I put together using IDs
from US EPA's CompTox Dashboard. I did some checks on identifier ambiguity
while doing it, but it was also the first time I had ever used any form of
RDBMS. I did not, for example, use any constraints (!). I will re-examine
or redo that when I get a chance (and come back to you if it ends up still
looking like an RDKit-related problem after all).

Once again, thank you for your help.

Yours,
Akos

Akos Kokai <http://kaios.net/>
PhD candidate, Department of Environmental Science, Policy & Management
<http://ourenvironment.berkeley.edu/>
Fellow, Berkeley Center for Green Chemistry <http://bcgc.berkeley.edu/>
University of California, Berkeley
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] SMARTS substructure queries with SQL conjunctions

2017-03-21 Thread Akos Kokai
Dear RDKit community,

I'm getting unexpected results when combining SMARTS substructure
comparisons in SQL statements, and I'd like to ask for feedback to help me
understand what's going on.

Given an element, say Au, when I make a query like this:

SELECT cpds.cid FROM cpds WHERE (cpds.molecule @> '[Au]' ::qmol) AND NOT
(cpds.molecule @> '[C,c]~[C,c]' ::qmol) AND NOT (cpds.molecule @>
'[C!H0,c!H0]' ::qmol)

I don't expect to see any compounds with C-C or C-H bonds in the results.
Yet I get results like [(P(C5F5)3)4Au]Cl [1], or for example with Se,
[(CH3)3Se]+ [2]. Why?

It seems that usually my 'unexpected' results are matching one of the two
"AND NOT" conditions, not both (see console output below) but I haven't
checked systematically. I want the query to return only molecules for which
the last two substructure conditions are both false. Is my understanding of
SQL conjunctions mistaken?

I'm using RDKit 2016-03 and the rdkit extension on PostgreSQL 9.4. I'm
probably not using RDKit for what it was intended, but I'm certainly
grateful that it exists and is free software. I'd very much appreciate any
feedback on this question.

Best regards,
Akos

--

[1]: https://pubchem.ncbi.nlm.nih.gov/compound/11520592
[2]: https://pubchem.ncbi.nlm.nih.gov/compound/91580

Some console output regarding those compounds:

In [3]: mSe = Chem.MolFromSmiles('C[Se+](C)C')

In [4]: mAu =
Chem.MolFromSmiles('C1(=C(C(=C(C(=C1F)F)P(C2=C(C(=C(C(=C2F)F)F)F)F)C3=C(C(=C(C(=C3F)F)F)F)F
   ...: )F)F)F.Cl[Au]')

In [5]: mSe.HasSubstructMatch(Chem.MolFromSmarts('[C,c]~[C,c]'))
Out[5]: False

In [6]: mAu.HasSubstructMatch(Chem.MolFromSmarts('[C,c]~[C,c]'))
Out[6]: True

In [7]: mSe.HasSubstructMatch(Chem.MolFromSmarts('[C!H0,c!H0]'))
Out[7]: True

In [8]: mAu.HasSubstructMatch(Chem.MolFromSmarts('[C!H0,c!H0]'))
Out[8]: False


Akos Kokai <http://kaios.net/>
PhD candidate, Department of Environmental Science, Policy & Management
<http://ourenvironment.berkeley.edu/>
Fellow, Berkeley Center for Green Chemistry <http://bcgc.berkeley.edu/>
University of California, Berkeley
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss