Re: [Rdkit-discuss] confused about explicit hydrogens and canonicalization

Andrew Dalke Wed, 03 Feb 2016 18:46:21 -0800

On Feb 3, 2016, at 6:42 AM, Greg Landrum wrote:
> 2) If you add a call to Chem.SanitizeMol(hydrogren_mol) before any of the 
> calls to SMILES generation, it clears up the problem. The calls to 
> SetNumExplicitHs() are not necessary.


I am able to fix my problem by adding a SANITIZE_ADJUSTHS :

def cut_with_hydrogens(mol, cuts):
    emol = Chem.RWMol(mol)
    for core_atom, rgroup_atom in cuts:
        emol.RemoveBond(core_atom, rgroup_atom)
        for atom in (core_atom, rgroup_atom):
            atom_obj = emol.GetAtomWithIdx(atom)
            atom_obj.SetNumExplicitHs(atom_obj.GetNumExplicitHs()+1)
            
    mol = emol.GetMol()
    # Note: perhaps full sanitization would be better here.
    rdmolops.SanitizeMol(mol, Chem.SanitizeFlags.SANITIZE_ADJUSTHS)
    return mol


>>> mol = Chem.MolFromSmiles("P(C)(C)(C)(C)C")
>>> mol2 = cut_with_hydrogens(mol, [(0,1),(0,2),(0,3),(0,4),(0,5)])
>>> Chem.MolToSmiles(mol2)
'C.C.C.C.C.[PH5]'

With this extra call, my two approaches (sytax-based editing and graph-based 
editing) now give the same answers in nearly all test cases, without needing to 
re-canonicalize the SMILES for the latter.


There are two exceptions, both too complicated for me to understand, though I 
tried for a couple of hours.

I put them here, and my analysis, for curiosity's sake. They are:

C(C1C2C(C(C(O1)OC3C(OC(C(C3O)O)OC4C(OC(C(C4O)O)OC5C(OC(C(C5O)O)OC6C(OC(C(C6O)O)OC7C(OC(C(C7O)O)OC8C(OC(O2)C(C8O)O)CO)CO)CO)CO)CO)CO)O)O)N9C12C93c4c5c6c7c8c9c%10c%11c%12c%13c%14c%10c%10c8c6c6c4c4c1c1c8c%15c(c%13c%13c%12c%12c%16c%11c9c9c7c7c5c3c3c5c7c9c%16c7c%12c9c%13c%15c%11c9c(c75)c3c2c%111)c1c%14c%10c6c4c18

CC(C)[Si](C#Cc1cc(ccc1C#Cc2c3cc(cc4ccc5cc(cc(c2C#Cc6ccc(cc6C#C[Si](C(C)C)(C(C)C)C(C)C)C(C)(C)C)C5(C43C)C)C(C)(C)C)C(C)(C)C)C(C)(C)C)(C(C)C)C(C)C

I'll use the second one because it's shorter:

I want to cut between atoms [(6, 5), (8, 64), (11, 12)]. (Note: these were 
found with the SMARTS "[R]-[!R;!#1]".)

Here's a guide:

   0                  1               2              3                  4
   01 2  3   4 56 78 901 2 34 5 67 89 012 34 56 7 8 90 123 45 6 7 8   9 0
   CC(C)[Si](C#Cc1cc(ccc1C#Cc2c3cc(cc4ccc5cc(cc(c2C#Cc6ccc(cc6C#C[Si](C(C)
        cut 1 -^^  ^   ^-^-- cut 2
                    \-----------------------------------------------------

   4                   5                      6                      7
   1  2 3 4 5 6 7 8 9  0 1 2  3  4 5 6 7  8 9 0 1  2 3 4 5  6 7  8 9 0 1 2 3
   C)(C(C)C)C(C)C)C(C)(C)C)C5(C43C)C)C(C)(C)C)C(C)(C)C)C(C)(C)C)(C(C)C)C(C)C
   ----------------------------------------------------^ cut 3

There's a neat technique I figured out this evening to transform the SMILES 
string automatically for non-ring single bonds that I want to cut and turn into 
a pair of hydrogens. Find the latter of the two atoms in the SMILES string, and 
prefix it with "[H].[H]". For example, "Cc1ccccc1" turns into 
"C[H].[H]c1ccccc1".

Then with a bit of regular expression matching, I can transform the above 
SMILES into the "[H].[H]" form:

============
import re

atom_pattern = re.compile(r"""
(
 Cl? |
 Br? |
 [NOSPFIbcnosp] |
 \[[^]]*\]
)
""", re.X)

def cut_with_syntax(smiles, cuts):
    # Find all of the atom symbols in the SMILES
    atom_matches = list(atom_pattern.finditer(smiles))
    #print([m.group() for m in atom_matches])
    
    # Find the latter of the two atom indices in a cut    
    atoms_to_modify = set(max(cut) for cut in cuts)
    
    # Assemble the new SMILES string from the old
    prev = 0
    terms = []
    for atom, atom_match in enumerate(atom_matches):
        end = atom_match.end()
        if atom in atoms_to_modify:
            # add the prefix for the right-most atoms
            terms.append(smiles[prev:atom_match.start()])
            terms.append("[H].[H]")
            terms.append(atom_match.group())
        else:
            terms.append(smiles[prev:end])
        prev = end
    terms.append(smiles[prev:]) # Include the last bit

    return "".join(terms)  # This is the new SMILES
============

I'll transform the SMILES using this syntax manipulation:

>>> smiles = 
>>> "CC(C)[Si](C#Cc1cc(ccc1C#Cc2c3cc(cc4ccc5cc(cc(c2C#Cc6ccc(cc6C#C[Si](C(C)C)(C(C)C)C(C)C)C(C)(C)C)C5(C43C)C)C(C)(C)C)C(C)(C)C)C(C)(C)C)(C(C)C)C(C)C"
>>> cuts = [(6, 5), (8, 64), (11, 12)]
>>> syntax_smiles = cut_with_syntax(smiles, cuts)
>>> print(syntax_smiles)
CC(C)[Si](C#C[H].[H]c1cc(ccc1[H].[H]C#Cc2c3cc(cc4ccc5cc(cc(c2C#Cc6ccc(cc6C#C[Si](C(C)C)(C(C)C)C(C)C)C(C)(C)C)C5(C43C)C)C(C)(C)C)C(C)(C)C)[H].[H]C(C)(C)C)(C(C)C)C(C)C

I've annotated the new SMILES using the same numbering system (i.e., left the 
hydrogens unannotated).

   0                         1                      2              3            
      4
   01 2  3   4 5       6 78 901        2 34 5 67 89 012 34 56 7 8 90 123 45 6 7 
8   9 0
   
CC(C)[Si](C#C[H].[H]c1cc(ccc1[H].[H]C#Cc2c3cc(cc4ccc5cc(cc(c2C#Cc6ccc(cc6C#C[Si](C(C)
        cut 1 -^^^^^^^^   ^   ^-^^^^^^^^-- cut 2
                           
\------------------------------------------------------------

   4                   5                      6                             7
   1  2 3 4 5 6 7 8 9  0 1 2  3  4 5 6 7  8 9 0 1  2 3        4 5  6 7  8 9 0 1 
2 3
   
C)(C(C)C)C(C)C)C(C)(C)C)C5(C43C)C)C(C)(C)C)C(C)(C)C)[H].[H]C(C)(C)C)(C(C)C)C(C)C
   ----------------------------------------------------^^^^^^^^ cut 3

As far as I can tell, I've violated no chemistry or done anything strange.

The following verifies that none of the cut bonds were ring bonds, and gives an 
idea of the fragments:

>>> syntax_mol = Chem.MolFromSmiles(syntax_smiles)
>>> [submol.GetNumAtoms() for submol in rdmolops.GetMolFrags(syntax_mol, True)]
[12, 6, 52, 4]

Therefore, I expect that the graph-based version will generate the following 
SMILES:

>>> expected_smiles = Chem.MolToSmiles(syntax_mol)
>>> print(expected_smiles)
C#CC1=C2C=C(C(C)(C)C)C=C3C=CC4=CC(C(C)(C)C)=CC(=C1C#Cc1ccc(C(C)(C)C)cc1C#C[Si](C(C)C)(C(C)C)C(C)C)C4(C)C32C.C#C[Si](C(C)C)(C(C)C)C(C)C.CC(C)C.c1ccccc1


The graph edit code I used is:

def cut_with_hydrogens(mol, cuts):
    emol = Chem.RWMol(mol)
    for core_atom, rgroup_atom in cuts:
        emol.RemoveBond(core_atom, rgroup_atom)
        for atom in (core_atom, rgroup_atom):
            atom_obj = emol.GetAtomWithIdx(atom)
            atom_obj.SetNumExplicitHs(atom_obj.GetNumExplicitHs()+1)
            
    mol = emol.GetMol()
    rdmolops.SanitizeMol(mol, Chem.SanitizeFlags.SANITIZE_ADJUSTHS)
    return mol

If I apply it to the same input SMILES, I get a structure with the same 
fragment size composition:

>>> input_mol = Chem.MolFromSmiles(smiles)
>>> cut_mol = cut_with_hydrogens(input_mol, cuts)
>>> [submol.GetNumAtoms() for submol in rdmolops.GetMolFrags(cut_mol, True)]
[12, 6, 52, 4]

But when I turn that into a SMILES, it does't match what I expected to match:

>>> cut_smiles = Chem.MolToSmiles(cut_mol)
>>> print(cut_smiles)
C#CC1C2=CC(=CC3=CC=C4C=C(C=C(C=1C#Cc1ccc(cc1C#C[Si](C(C)C)(C(C)C)C(C)C)C(C)(C)C)C4(C)C32C)C(C)(C)C)C(C)(C)C.C#C[Si](C(C)C)(C(C)C)C(C)C.CC(C)C.c1ccccc1
>>> cut_smiles == expected_smiles
False


For the life of me, I can't figure out why they are different!


The core structure going through explicit graph editing (using 
SetNumExplicitHs()) is:

C#CC1=C2C=C(C(C)(C)C)C=C3C=CC4=CC(C(C)(C)C)=CC(=C1C#Cc1ccc(C(C)(C)C)cc1C#C[Si](C(C)C)(C(C)C)C(C)C)C4(C)C32C

Here's the depiction using PubChem's CACTVS entry tool:


The core structure using syntax manipulation of the SMILES string (inserting 
"[H].[H]") is:

C#CC1C2=CC(=CC3=CC=C4C=C(C=C(C=1C#Cc1ccc(cc1C#C[Si](C(C)C)(C(C)C)C(C)C)C(C)(C)C)C4(C)C32C)C(C)(C)C)C(C)(C)C




You can see that there is a ring perception difference between the two, and the 
first has a "C#C-H" that isn't present in the second. But that doesn't tell me 
what RDKit is doing.


For fun, I also replaced the 

    rdmolops.SanitizeMol(mol, Chem.SanitizeFlags.SANITIZE_ADJUSTHS)
with
    rdmolops.SanitizeMol(mol)

This gives me a third SMILES string!

>>> cut_mol2 = cut_with_hydrogens2(input_mol, cuts)
>>> cut_smiles2 = Chem.MolToSmiles(cut_mol2)
>>> cut_smiles2 == cut_smiles
False
>>> cut_smiles2 == expected_smiles
False

(To repeat the above strings, for reference:

>>> print(cut_smiles) # SetNumExplicitHs() + SANITIZE_ADJUSTHS
C#CC1C2=CC(=CC3=CC=C4C=C(C=C(C=1C#Cc1ccc(cc1C#C[Si](C(C)C)(C(C)C)C(C)C)C(C)(C)C)C4(C)C32C)C(C)(C)C)C(C)(C)C.C#C[Si](C(C)C)(C(C)C)C(C)C.CC(C)C.c1ccccc1

>>> print(cut_smiles2)  # SetNumExplicitHs()  + using full sanitization
C#CC1=C(C#Cc2ccc(C(C)(C)C)cc2C#C[Si](C(C)C)(C(C)C)C(C)C)C2=CC(C(C)(C)C)=CC3=CC=C4C=C(C(C)(C)C)C=C1C4(C)C32C.C#C[Si](C(C)C)(C(C)C)C(C)C.CC(C)C.c1ccccc1

>>> print(expected_smiles)  # using '[H].[H]' insertions in the SMILE
C#CC1=C2C=C(C(C)(C)C)C=C3C=CC4=CC(C(C)(C)C)=CC(=C1C#Cc1ccc(C(C)(C)C)cc1C#C[Si](C(C)C)(C(C)C)C(C)C)C4(C)C32C.C#C[Si](C(C)C)(C(C)C)C(C)C.CC(C)C.c1ccccc1

)

And yet another solution - add real hydrogen atoms to the molecular graph then 
call RemoveHs:

def cut_with_explicit_hydrogens(mol, cuts):
    emol = Chem.RWMol(mol)
    for core_atom, rgroup_atom in cuts:
        emol.RemoveBond(core_atom, rgroup_atom)
        for atom in (core_atom, rgroup_atom):
            new_atom = emol.AddAtom(Chem.Atom(1))
            emol.AddBond(atom, new_atom, Chem.BondType.SINGLE)
            
    return Chem.RemoveHs(emil)

This one gives me the same solution as "SetNumExplicitHs()  + using full 
sanitization".

>>> h_mol = cut_with_explicit_hydrogens(input_mol, cuts)
>>> h_smiles = Chem.MolToSmiles(hmol)
>>> print(h_smiles)
C#CC1=C(C#Cc2ccc(C(C)(C)C)cc2C#C[Si](C(C)C)(C(C)C)C(C)C)C2=CC(C(C)(C)C)=CC3=CC=C4C=C(C(C)(C)C)C=C1C4(C)C32C.C#C[Si](C(C)C)(C(C)C)C(C)C.CC(C)C.c1ccccc1
>>> cut_smiles2 == h_smiles
True


This suggests that Chem.SanitizeFlags.SANITIZE_ADJUSTHS isn't enough for my 
sanitization needs, so for consistency's sake I should probably do full 
sanitization until I figure out what's going on.

Not that it will help in this case, but it might make me feel better.

Cheers,


                                Andrew
                                [email protected]

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] confused about explicit hydrogens and canonicalization

Reply via email to