[Rdkit-discuss] confused about explicit hydrogens and canonicalization

Andrew Dalke Mon, 01 Feb 2016 23:07:03 -0800

Hi all,

  I have a problem that I think is due to my not understanding how to work with 
explicit (or perhaps implicit) hydrogens.


  In my project, I want to find the core of a molecule as well as its R-groups. 
I use a SMARTS pattern to find the bonds to cut, then want to store two 
versions of the core:

  1) replace the R-groups with hydrogens, and store the canonical SMILES for 
the hydrogenated core. I do this by adding explicit hydrogens to each of the 
atoms where a bond was cut,

  2) add "-[*]" extensions on each of the atoms where a bond was cut, and store 
its canonical SMILES

For example, if the structure is "OCCN" with a core of "CC" and the two 
R-groups "O" and "N", then the two versions might be (1) "CC" and (2) 
"[*]CC[*]".


To make sure I did things correctly, I used string substitution to replace the 
"[*]" terms from form (2) with "[H]", then parsed and regenerated the SMILES to 
create (2H). This should generate the identical output as (1).

Unfortunately, it doesn't. But if I take the output from (1) and 
reparse/regenerate the SMILES string, then while I expect the SMILES to be 
unchanged, it actually produces a new SMILES (1'), which is identical to (2H).


I've attached a reproducible. Here's the output:

Using RDKit version 2016.03.1.dev1
 == Increase explicit hydrogen count & make SMILES / reperceive & make SMILES  
==
   hydrogen smiles: c1ccc(nc1)-n1ncc2ccc(nc21)C1CC1
 reparsed h-smiles: c1ccc(-n2ncc3ccc(C4CC4)nc32)nc1

 == Add '*' atoms & make SMILES /  *->'H' / reperceive & make SMILES ==
       star smiles: [*]c1cc(nc2c1c([*])nn2-c1ccccn1)C1CC1
substituted smiles: [H]c1cc(nc2c1c([H])nn2-c1ccccn1)C1CC1
   absorbed smiles: c1ccc(-n2ncc3ccc(C4CC4)nc32)nc1

In this output, the "hydrogen smiles" is (1), which I generated by adding 
explicit hydrogens, and the "reparsed h-smiles" is (1'), which I generate by 
reparsing/canonicalizing (1).

While the "star smiles" is the is (2), where the "substituted smiles" is the 
text replacement of "*" with "H". Finally, I parse the substituted smiles and 
generate the final canonical SMILES, which I term the "absorbed smiles" because 
the SMILES parser by default places the '[H]' terms on the parent atom. This is 
(2H) in my above notation.

I expected the "hydrogen smiles" to be the same as the "reparsed h-smiles" and 
the "absorbed smiles", but it is not.

I can't figure out what I did wrong.

As a work-around, I could reperceive my (1) to get (1'), or use the *->H 
technique to get (2H). However, that seems inelegant, and parsing the SMILES 
string has a high overhead so this will substantially lower my performance.


                                Andrew
                                [email protected]

from __future__ import print_function

import rdkit
from rdkit import Chem

print("Using RDKit version", rdkit.__version__)

def get_longest_smiles_component(smiles):
        terms = smiles.split(".")
        return max(terms, key=len)

# Cut the structure between the two bonds, and replace the ends with hydrogens.
# (I only care about the core structure, which in this case is the longest component.)
mol = Chem.MolFromSmiles("Cc1c2c(cc(nc2n(n1)c3ccccn3)C4CC4)C(=O)NCCc5ccc(cc5)O")
cut_atoms = ((1, 0), (3, 19))

# I tried two ways to do this: 1) increase the hydrogen count, and 2) add an explicit atom
# Both start by removing the specified bonds
emol = Chem.EditableMol(mol)
for from_atom, to_atom in cut_atoms:
    emol.RemoveBond(from_atom, to_atom)

## Option #1: Add an explicit hydrogen to each of the atoms, and extract the core.

print(" == Increase explicit hydrogen count & make SMILES / reperceive & make SMILES  ==")
hydrogen_mol = emol.GetMol()
for cut_pair in cut_atoms:
        for atom in cut_pair:
                # Increase the explicit hydrogen count of each atom where
                # a bond was removed.
                atom_obj = hydrogen_mol.GetAtomWithIdx(atom)
                num_Hs = atom_obj.GetNumExplicitHs()
                atom_obj.SetNumExplicitHs(num_Hs+1)

# The core is the longest component of the SMILES:
complete_hydrogen_smiles = Chem.MolToSmiles(hydrogen_mol, isomericSmiles=True)
core_hydrogen_smiles = get_longest_smiles_component(complete_hydrogen_smiles)
print("   hydrogen smiles:", core_hydrogen_smiles)
# This gives: c1ccc(nc1)-n1ncc2ccc(nc21)C1CC1
# That SMILES appears to be incorrect!

# If I parse that SMILES and generate it again, I expect the same thing
reparsed_hydrogen_mol = Chem.MolFromSmiles(core_hydrogen_smiles)
reparsed_hydrogen_smiles = Chem.MolToSmiles(reparsed_hydrogen_mol, isomericSmiles=True)
print(" reparsed h-smiles:", reparsed_hydrogen_smiles)
# But I get the different SMILES: c1ccc(-n2ncc3ccc(C4CC4)nc32)nc1
# I expected this to match the previous SMILES!!!

## Option #2: add explicit '*' atoms, since in my real goal I track these positions.
# Use text substitution to convert '*' to 'H' atoms in the SMILES string,
# then reperceive and generate a new SMILES.

print("\n == Add '*' atoms & make SMILES /  *->'H' / reperceive & make SMILES ==")

## Add "*" atoms to each of the ends
for cut_pair in cut_atoms:
        for atom in cut_pair:
                new_atom = emol.AddAtom(Chem.Atom(0))
                emol.AddBond(atom, new_atom, Chem.BondType.SINGLE)

# Get the core SMILES with the 'star' extensions
star_mol = emol.GetMol()
complete_star_smiles = Chem.MolToSmiles(star_mol, isomericSmiles=True)
core_star_smiles = get_longest_smiles_component(complete_star_smiles)
print("       star smiles:", core_star_smiles)
# This gives: [*]c1cc(nc2c1c([*])nn2-c1ccccn1)C1CC1

# Replace the '*' with the 'H'
substituted_smiles = core_star_smiles.replace("*", "H")
print("substituted smiles:", substituted_smiles)
# This gives: [H]c1cc(nc2c1c([H])nn2-c1ccccn1)C1CC1

# Let the SMILES parser "absorb" the hydrogen to the parent atom
absorbed_mol = Chem.MolFromSmiles(substituted_smiles)
absorbed_smiles = Chem.MolToSmiles(absorbed_mol, isomericSmiles=True)
print("   absorbed smiles:", absorbed_smiles)
# This gives: c1ccc(-n2ncc3ccc(C4CC4)nc32)nc1

# You can see that the "absorbed_smiles" matches the
# "reparsed_hydrogen_smiles" which is what I expect.

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

[Rdkit-discuss] confused about explicit hydrogens and canonicalization

Reply via email to