Thanks for doing this Greg.
Fixing those SMARTS queries always looked like it would be a real...pain.
I've dropped your Github file into the KNIME workflow, and the RDKit
version of the workflow (using nodes RDKit 2.5.0.201505221301) now hits
770 structures in the WEHI-10k test set. But that includes 19 false
positives that weren't being caught by the SLN filters.
One filter alone is responsible for 17 of those false positives:
anil_di_alk_C(246)
old: c:1:c:c(:c:c:c:1-[#8]-[#6;X4])-[#7](-[#6;X4])-[$([#1]),$([#6;X4])]
new: c:1:c:c(:c:c:c:1-[#8]-[#6;X4])-[#7;!H0,$([#7]-[#6;X4])]-[#6;X4]
An example of one of the false positive structures is the aniline
sulfonamide WEHI-18518.
I've checked with Johnathan, and the intention of that query is that
"... that the nitrogen has a single bond to a carbon that has four atoms
bonded to it (i.e. sp3), and that the other atom singly bonded to the
nitrogen atom is anything so long as it is either H or an sp3 carbon".
So no to sulfonamides, and also some of the acetamide (sp2 C) showing up
as hits.
--
Cheers,
Simon
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss