Hi all,
I have a problem that I think is due to my not understanding how to work with
explicit (or perhaps implicit) hydrogens.
In my project, I want to find the core of a molecule as well as its R-groups.
I use a SMARTS pattern to find the bonds to cut, then want to store two
versions of the core:
1) replace the R-groups with hydrogens, and store the canonical SMILES for
the hydrogenated core. I do this by adding explicit hydrogens to each of the
atoms where a bond was cut,
2) add "-[*]" extensions on each of the atoms where a bond was cut, and store
its canonical SMILES
For example, if the structure is "OCCN" with a core of "CC" and the two
R-groups "O" and "N", then the two versions might be (1) "CC" and (2)
"[*]CC[*]".
To make sure I did things correctly, I used string substitution to replace the
"[*]" terms from form (2) with "[H]", then parsed and regenerated the SMILES to
create (2H). This should generate the identical output as (1).
Unfortunately, it doesn't. But if I take the output from (1) and
reparse/regenerate the SMILES string, then while I expect the SMILES to be
unchanged, it actually produces a new SMILES (1'), which is identical to (2H).
I've attached a reproducible. Here's the output:
Using RDKit version 2016.03.1.dev1
== Increase explicit hydrogen count & make SMILES / reperceive & make SMILES
==
hydrogen smiles: c1ccc(nc1)-n1ncc2ccc(nc21)C1CC1
reparsed h-smiles: c1ccc(-n2ncc3ccc(C4CC4)nc32)nc1
== Add '*' atoms & make SMILES / *->'H' / reperceive & make SMILES ==
star smiles: [*]c1cc(nc2c1c([*])nn2-c1ccccn1)C1CC1
substituted smiles: [H]c1cc(nc2c1c([H])nn2-c1ccccn1)C1CC1
absorbed smiles: c1ccc(-n2ncc3ccc(C4CC4)nc32)nc1
In this output, the "hydrogen smiles" is (1), which I generated by adding
explicit hydrogens, and the "reparsed h-smiles" is (1'), which I generate by
reparsing/canonicalizing (1).
While the "star smiles" is the is (2), where the "substituted smiles" is the
text replacement of "*" with "H". Finally, I parse the substituted smiles and
generate the final canonical SMILES, which I term the "absorbed smiles" because
the SMILES parser by default places the '[H]' terms on the parent atom. This is
(2H) in my above notation.
I expected the "hydrogen smiles" to be the same as the "reparsed h-smiles" and
the "absorbed smiles", but it is not.
I can't figure out what I did wrong.
As a work-around, I could reperceive my (1) to get (1'), or use the *->H
technique to get (2H). However, that seems inelegant, and parsing the SMILES
string has a high overhead so this will substantially lower my performance.
Andrew
[email protected]
from __future__ import print_function
import rdkit
from rdkit import Chem
print("Using RDKit version", rdkit.__version__)
def get_longest_smiles_component(smiles):
terms = smiles.split(".")
return max(terms, key=len)
# Cut the structure between the two bonds, and replace the ends with hydrogens.
# (I only care about the core structure, which in this case is the longest component.)
mol = Chem.MolFromSmiles("Cc1c2c(cc(nc2n(n1)c3ccccn3)C4CC4)C(=O)NCCc5ccc(cc5)O")
cut_atoms = ((1, 0), (3, 19))
# I tried two ways to do this: 1) increase the hydrogen count, and 2) add an explicit atom
# Both start by removing the specified bonds
emol = Chem.EditableMol(mol)
for from_atom, to_atom in cut_atoms:
emol.RemoveBond(from_atom, to_atom)
## Option #1: Add an explicit hydrogen to each of the atoms, and extract the core.
print(" == Increase explicit hydrogen count & make SMILES / reperceive & make SMILES ==")
hydrogen_mol = emol.GetMol()
for cut_pair in cut_atoms:
for atom in cut_pair:
# Increase the explicit hydrogen count of each atom where
# a bond was removed.
atom_obj = hydrogen_mol.GetAtomWithIdx(atom)
num_Hs = atom_obj.GetNumExplicitHs()
atom_obj.SetNumExplicitHs(num_Hs+1)
# The core is the longest component of the SMILES:
complete_hydrogen_smiles = Chem.MolToSmiles(hydrogen_mol, isomericSmiles=True)
core_hydrogen_smiles = get_longest_smiles_component(complete_hydrogen_smiles)
print(" hydrogen smiles:", core_hydrogen_smiles)
# This gives: c1ccc(nc1)-n1ncc2ccc(nc21)C1CC1
# That SMILES appears to be incorrect!
# If I parse that SMILES and generate it again, I expect the same thing
reparsed_hydrogen_mol = Chem.MolFromSmiles(core_hydrogen_smiles)
reparsed_hydrogen_smiles = Chem.MolToSmiles(reparsed_hydrogen_mol, isomericSmiles=True)
print(" reparsed h-smiles:", reparsed_hydrogen_smiles)
# But I get the different SMILES: c1ccc(-n2ncc3ccc(C4CC4)nc32)nc1
# I expected this to match the previous SMILES!!!
## Option #2: add explicit '*' atoms, since in my real goal I track these positions.
# Use text substitution to convert '*' to 'H' atoms in the SMILES string,
# then reperceive and generate a new SMILES.
print("\n == Add '*' atoms & make SMILES / *->'H' / reperceive & make SMILES ==")
## Add "*" atoms to each of the ends
for cut_pair in cut_atoms:
for atom in cut_pair:
new_atom = emol.AddAtom(Chem.Atom(0))
emol.AddBond(atom, new_atom, Chem.BondType.SINGLE)
# Get the core SMILES with the 'star' extensions
star_mol = emol.GetMol()
complete_star_smiles = Chem.MolToSmiles(star_mol, isomericSmiles=True)
core_star_smiles = get_longest_smiles_component(complete_star_smiles)
print(" star smiles:", core_star_smiles)
# This gives: [*]c1cc(nc2c1c([*])nn2-c1ccccn1)C1CC1
# Replace the '*' with the 'H'
substituted_smiles = core_star_smiles.replace("*", "H")
print("substituted smiles:", substituted_smiles)
# This gives: [H]c1cc(nc2c1c([H])nn2-c1ccccn1)C1CC1
# Let the SMILES parser "absorb" the hydrogen to the parent atom
absorbed_mol = Chem.MolFromSmiles(substituted_smiles)
absorbed_smiles = Chem.MolToSmiles(absorbed_mol, isomericSmiles=True)
print(" absorbed smiles:", absorbed_smiles)
# This gives: c1ccc(-n2ncc3ccc(C4CC4)nc32)nc1
# You can see that the "absorbed_smiles" matches the
# "reparsed_hydrogen_smiles" which is what I expect.
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss