Interesting timing! I have just come across this exact same issue when
experimenting with new SMARTS-based standardisations (so, yes, treating
unnatural molecules in unwholesome ways).
I 'fixed' it by calling SanitzeMol twice.
On 31/10/2018 14:48, Ivan Tubert-Brohman wrote:
Hi Greg,
Thanks for the detailed explanation. You are right that this is not a
real molecule; it came from applying a user-supplied reaction SMARTS.
(The reaction SMARTS was not the best-written perhaps, but that's
tangential...). I normally sanitize the products and skip those that
fail the sanitization, but in this case I was surprised when the
sanitized molecule caused issues later while trying to compute
descriptors.
I look forward to a fix, but in the meantime maybe I'll consider
running SanitzeMol twice. :-)
Best,
Ivan
On Wed, Oct 31, 2018 at 2:41 AM Greg Landrum <greg.land...@gmail.com
<mailto:greg.land...@gmail.com>> wrote:
Hi Ivan,
Short answer: I would not normally expect a second sanitization to
fail if the first succeeds, but your input SMILES is very odd and
triggers a bug.
This is an interesting edge case for the sanitization code because
it includes a weird mix of aromatic and aliphatic atoms and bonds,
I do hope this came out of some computational process and isn't a
"real" molecule. You almost couldn't have picked a better example
to highlight the situation that's causing the problem here. Some
form of congratulations are in order. :-)
Here's an explanation of what's going on with your molecule
C1=n(C)-c=Cn1
The fundamental problem is that atom 1 (the first nitrogen) has a
valence of 4 and is neutral...
If you wrote the SMILES as C1=N(C)C=CN1, which is what the
sanitization process produces, I don't think you'd be surprised
that the RDKit sanitization fails (and your second call to
sanitize does fail).
To understand why it passes the first time, you need to understand
the flow of the sanitization process, described here;
https://www.rdkit.org/docs/RDKit_Book.html#molecular-sanitization
Step 3, updatePropertyCache(), is the part that reports valency
errors. There's a special case in this code for aromatic atoms
that allows atoms like the N in Cn1cccc1 to pass sanitization even
though they are formally four-valent (2x1.5 for the aromatic
bondsĀ +1 for the C). Your molecule is triggering that special case
because atom 1 is aromatic in the input SMILES. Incorrect aromatic
rings that get through this step normally end up getting caught
later when the molecule is kekulized (step 5). In your case there
are no aromatic bonds to kekulize, so no error is thrown. The
aromaticity perception (step 6) does not consider the ring to be
aromatic, so the final molecule is the equivalent of C1=N(C)C=CN1.
It ought to be possible to clear this in the sanitization code
relatively easily; I just need to think about it a bit and do a
bunch of testing.
-greg
On Tue, Oct 30, 2018 at 10:02 PM Ivan Tubert-Brohman
<ivan.tubert-broh...@schrodinger.com
<mailto:ivan.tubert-broh...@schrodinger.com>> wrote:
Hi,
I was surprised to see that a (dubious) structure that goes
through SanitizeMol OK can fail a subsequent sanitization call:
print("Start")
mol = Chem.MolFromSmiles('C1=n(C)-c=Cn1', sanitize=False)
print("Before first sanitization")
Chem.SanitizeMol(mol)
print("Before second sanitization")
Chem.SanitizeMol(mol)
print("Done")
The output is:
Start
Before first sanitization
Before second sanitization
[16:54:20] Explicit valence for atom # 1 N, 4, is greater
than permitted
Traceback (most recent call last):
File "./san.py", line 9, in <module>
Chem.SanitizeMol(mol)
ValueError: Sanitization error: Explicit valence for atom
# 1 N, 4, is greater than permitted
Is this an unavoidable aspect of the way SanitizeMol works,
since it does several operations (Kekulize, check valencies,
set aromaticity, conjugation and hybridization) in a certain
order, or should this be considered a bug?
Best,
Ivan
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
--
Dr Francis L Atkinson
Chemogenomics Group
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom
(01223) 494473
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss