On Feb 3, 2016, at 6:42 AM, Greg Landrum wrote:
> 2) If you add a call to Chem.SanitizeMol(hydrogren_mol) before any of the
> calls to SMILES generation, it clears up the problem. The calls to
> SetNumExplicitHs() are not necessary.
I am able to fix my problem by adding a SANITIZE_ADJUSTHS :
def cut_with_hydrogens(mol, cuts):
emol = Chem.RWMol(mol)
for core_atom, rgroup_atom in cuts:
emol.RemoveBond(core_atom, rgroup_atom)
for atom in (core_atom, rgroup_atom):
atom_obj = emol.GetAtomWithIdx(atom)
atom_obj.SetNumExplicitHs(atom_obj.GetNumExplicitHs()+1)
mol = emol.GetMol()
# Note: perhaps full sanitization would be better here.
rdmolops.SanitizeMol(mol, Chem.SanitizeFlags.SANITIZE_ADJUSTHS)
return mol
>>> mol = Chem.MolFromSmiles("P(C)(C)(C)(C)C")
>>> mol2 = cut_with_hydrogens(mol, [(0,1),(0,2),(0,3),(0,4),(0,5)])
>>> Chem.MolToSmiles(mol2)
'C.C.C.C.C.[PH5]'
With this extra call, my two approaches (sytax-based editing and graph-based
editing) now give the same answers in nearly all test cases, without needing to
re-canonicalize the SMILES for the latter.
There are two exceptions, both too complicated for me to understand, though I
tried for a couple of hours.
I put them here, and my analysis, for curiosity's sake. They are:
C(C1C2C(C(C(O1)OC3C(OC(C(C3O)O)OC4C(OC(C(C4O)O)OC5C(OC(C(C5O)O)OC6C(OC(C(C6O)O)OC7C(OC(C(C7O)O)OC8C(OC(O2)C(C8O)O)CO)CO)CO)CO)CO)CO)O)O)N9C12C93c4c5c6c7c8c9c%10c%11c%12c%13c%14c%10c%10c8c6c6c4c4c1c1c8c%15c(c%13c%13c%12c%12c%16c%11c9c9c7c7c5c3c3c5c7c9c%16c7c%12c9c%13c%15c%11c9c(c75)c3c2c%111)c1c%14c%10c6c4c18
CC(C)[Si](C#Cc1cc(ccc1C#Cc2c3cc(cc4ccc5cc(cc(c2C#Cc6ccc(cc6C#C[Si](C(C)C)(C(C)C)C(C)C)C(C)(C)C)C5(C43C)C)C(C)(C)C)C(C)(C)C)C(C)(C)C)(C(C)C)C(C)C
I'll use the second one because it's shorter:
I want to cut between atoms [(6, 5), (8, 64), (11, 12)]. (Note: these were
found with the SMARTS "[R]-[!R;!#1]".)
Here's a guide:
0 1 2 3 4
01 2 3 4 56 78 901 2 34 5 67 89 012 34 56 7 8 90 123 45 6 7 8 9 0
CC(C)[Si](C#Cc1cc(ccc1C#Cc2c3cc(cc4ccc5cc(cc(c2C#Cc6ccc(cc6C#C[Si](C(C)
cut 1 -^^ ^ ^-^-- cut 2
\-----------------------------------------------------
4 5 6 7
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
C)(C(C)C)C(C)C)C(C)(C)C)C5(C43C)C)C(C)(C)C)C(C)(C)C)C(C)(C)C)(C(C)C)C(C)C
----------------------------------------------------^ cut 3
There's a neat technique I figured out this evening to transform the SMILES
string automatically for non-ring single bonds that I want to cut and turn into
a pair of hydrogens. Find the latter of the two atoms in the SMILES string, and
prefix it with "[H].[H]". For example, "Cc1ccccc1" turns into
"C[H].[H]c1ccccc1".
Then with a bit of regular expression matching, I can transform the above
SMILES into the "[H].[H]" form:
============
import re
atom_pattern = re.compile(r"""
(
Cl? |
Br? |
[NOSPFIbcnosp] |
\[[^]]*\]
)
""", re.X)
def cut_with_syntax(smiles, cuts):
# Find all of the atom symbols in the SMILES
atom_matches = list(atom_pattern.finditer(smiles))
#print([m.group() for m in atom_matches])
# Find the latter of the two atom indices in a cut
atoms_to_modify = set(max(cut) for cut in cuts)
# Assemble the new SMILES string from the old
prev = 0
terms = []
for atom, atom_match in enumerate(atom_matches):
end = atom_match.end()
if atom in atoms_to_modify:
# add the prefix for the right-most atoms
terms.append(smiles[prev:atom_match.start()])
terms.append("[H].[H]")
terms.append(atom_match.group())
else:
terms.append(smiles[prev:end])
prev = end
terms.append(smiles[prev:]) # Include the last bit
return "".join(terms) # This is the new SMILES
============
I'll transform the SMILES using this syntax manipulation:
>>> smiles =
>>> "CC(C)[Si](C#Cc1cc(ccc1C#Cc2c3cc(cc4ccc5cc(cc(c2C#Cc6ccc(cc6C#C[Si](C(C)C)(C(C)C)C(C)C)C(C)(C)C)C5(C43C)C)C(C)(C)C)C(C)(C)C)C(C)(C)C)(C(C)C)C(C)C"
>>> cuts = [(6, 5), (8, 64), (11, 12)]
>>> syntax_smiles = cut_with_syntax(smiles, cuts)
>>> print(syntax_smiles)
CC(C)[Si](C#C[H].[H]c1cc(ccc1[H].[H]C#Cc2c3cc(cc4ccc5cc(cc(c2C#Cc6ccc(cc6C#C[Si](C(C)C)(C(C)C)C(C)C)C(C)(C)C)C5(C43C)C)C(C)(C)C)C(C)(C)C)[H].[H]C(C)(C)C)(C(C)C)C(C)C
I've annotated the new SMILES using the same numbering system (i.e., left the
hydrogens unannotated).
0 1 2 3
4
01 2 3 4 5 6 78 901 2 34 5 67 89 012 34 56 7 8 90 123 45 6 7
8 9 0
CC(C)[Si](C#C[H].[H]c1cc(ccc1[H].[H]C#Cc2c3cc(cc4ccc5cc(cc(c2C#Cc6ccc(cc6C#C[Si](C(C)
cut 1 -^^^^^^^^ ^ ^-^^^^^^^^-- cut 2
\------------------------------------------------------------
4 5 6 7
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
2 3
C)(C(C)C)C(C)C)C(C)(C)C)C5(C43C)C)C(C)(C)C)C(C)(C)C)[H].[H]C(C)(C)C)(C(C)C)C(C)C
----------------------------------------------------^^^^^^^^ cut 3
As far as I can tell, I've violated no chemistry or done anything strange.
The following verifies that none of the cut bonds were ring bonds, and gives an
idea of the fragments:
>>> syntax_mol = Chem.MolFromSmiles(syntax_smiles)
>>> [submol.GetNumAtoms() for submol in rdmolops.GetMolFrags(syntax_mol, True)]
[12, 6, 52, 4]
Therefore, I expect that the graph-based version will generate the following
SMILES:
>>> expected_smiles = Chem.MolToSmiles(syntax_mol)
>>> print(expected_smiles)
C#CC1=C2C=C(C(C)(C)C)C=C3C=CC4=CC(C(C)(C)C)=CC(=C1C#Cc1ccc(C(C)(C)C)cc1C#C[Si](C(C)C)(C(C)C)C(C)C)C4(C)C32C.C#C[Si](C(C)C)(C(C)C)C(C)C.CC(C)C.c1ccccc1
The graph edit code I used is:
def cut_with_hydrogens(mol, cuts):
emol = Chem.RWMol(mol)
for core_atom, rgroup_atom in cuts:
emol.RemoveBond(core_atom, rgroup_atom)
for atom in (core_atom, rgroup_atom):
atom_obj = emol.GetAtomWithIdx(atom)
atom_obj.SetNumExplicitHs(atom_obj.GetNumExplicitHs()+1)
mol = emol.GetMol()
rdmolops.SanitizeMol(mol, Chem.SanitizeFlags.SANITIZE_ADJUSTHS)
return mol
If I apply it to the same input SMILES, I get a structure with the same
fragment size composition:
>>> input_mol = Chem.MolFromSmiles(smiles)
>>> cut_mol = cut_with_hydrogens(input_mol, cuts)
>>> [submol.GetNumAtoms() for submol in rdmolops.GetMolFrags(cut_mol, True)]
[12, 6, 52, 4]
But when I turn that into a SMILES, it does't match what I expected to match:
>>> cut_smiles = Chem.MolToSmiles(cut_mol)
>>> print(cut_smiles)
C#CC1C2=CC(=CC3=CC=C4C=C(C=C(C=1C#Cc1ccc(cc1C#C[Si](C(C)C)(C(C)C)C(C)C)C(C)(C)C)C4(C)C32C)C(C)(C)C)C(C)(C)C.C#C[Si](C(C)C)(C(C)C)C(C)C.CC(C)C.c1ccccc1
>>> cut_smiles == expected_smiles
False
For the life of me, I can't figure out why they are different!
The core structure going through explicit graph editing (using
SetNumExplicitHs()) is:
C#CC1=C2C=C(C(C)(C)C)C=C3C=CC4=CC(C(C)(C)C)=CC(=C1C#Cc1ccc(C(C)(C)C)cc1C#C[Si](C(C)C)(C(C)C)C(C)C)C4(C)C32C
Here's the depiction using PubChem's CACTVS entry tool:
The core structure using syntax manipulation of the SMILES string (inserting
"[H].[H]") is:
C#CC1C2=CC(=CC3=CC=C4C=C(C=C(C=1C#Cc1ccc(cc1C#C[Si](C(C)C)(C(C)C)C(C)C)C(C)(C)C)C4(C)C32C)C(C)(C)C)C(C)(C)C
You can see that there is a ring perception difference between the two, and the
first has a "C#C-H" that isn't present in the second. But that doesn't tell me
what RDKit is doing.
For fun, I also replaced the
rdmolops.SanitizeMol(mol, Chem.SanitizeFlags.SANITIZE_ADJUSTHS)
with
rdmolops.SanitizeMol(mol)
This gives me a third SMILES string!
>>> cut_mol2 = cut_with_hydrogens2(input_mol, cuts)
>>> cut_smiles2 = Chem.MolToSmiles(cut_mol2)
>>> cut_smiles2 == cut_smiles
False
>>> cut_smiles2 == expected_smiles
False
(To repeat the above strings, for reference:
>>> print(cut_smiles) # SetNumExplicitHs() + SANITIZE_ADJUSTHS
C#CC1C2=CC(=CC3=CC=C4C=C(C=C(C=1C#Cc1ccc(cc1C#C[Si](C(C)C)(C(C)C)C(C)C)C(C)(C)C)C4(C)C32C)C(C)(C)C)C(C)(C)C.C#C[Si](C(C)C)(C(C)C)C(C)C.CC(C)C.c1ccccc1
>>> print(cut_smiles2) # SetNumExplicitHs() + using full sanitization
C#CC1=C(C#Cc2ccc(C(C)(C)C)cc2C#C[Si](C(C)C)(C(C)C)C(C)C)C2=CC(C(C)(C)C)=CC3=CC=C4C=C(C(C)(C)C)C=C1C4(C)C32C.C#C[Si](C(C)C)(C(C)C)C(C)C.CC(C)C.c1ccccc1
>>> print(expected_smiles) # using '[H].[H]' insertions in the SMILE
C#CC1=C2C=C(C(C)(C)C)C=C3C=CC4=CC(C(C)(C)C)=CC(=C1C#Cc1ccc(C(C)(C)C)cc1C#C[Si](C(C)C)(C(C)C)C(C)C)C4(C)C32C.C#C[Si](C(C)C)(C(C)C)C(C)C.CC(C)C.c1ccccc1
)
And yet another solution - add real hydrogen atoms to the molecular graph then
call RemoveHs:
def cut_with_explicit_hydrogens(mol, cuts):
emol = Chem.RWMol(mol)
for core_atom, rgroup_atom in cuts:
emol.RemoveBond(core_atom, rgroup_atom)
for atom in (core_atom, rgroup_atom):
new_atom = emol.AddAtom(Chem.Atom(1))
emol.AddBond(atom, new_atom, Chem.BondType.SINGLE)
return Chem.RemoveHs(emil)
This one gives me the same solution as "SetNumExplicitHs() + using full
sanitization".
>>> h_mol = cut_with_explicit_hydrogens(input_mol, cuts)
>>> h_smiles = Chem.MolToSmiles(hmol)
>>> print(h_smiles)
C#CC1=C(C#Cc2ccc(C(C)(C)C)cc2C#C[Si](C(C)C)(C(C)C)C(C)C)C2=CC(C(C)(C)C)=CC3=CC=C4C=C(C(C)(C)C)C=C1C4(C)C32C.C#C[Si](C(C)C)(C(C)C)C(C)C.CC(C)C.c1ccccc1
>>> cut_smiles2 == h_smiles
True
This suggests that Chem.SanitizeFlags.SANITIZE_ADJUSTHS isn't enough for my
sanitization needs, so for consistency's sake I should probably do full
sanitization until I figure out what's going on.
Not that it will help in this case, but it might make me feel better.
Cheers,
Andrew
[email protected]
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss