Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

Andrew Dalke Thu, 09 Nov 2017 13:54:05 -0800

On Nov 9, 2017, at 21:49, Brian Cole <col...@gmail.com> wrote:
> Certainly, but thousands of lines of Python doesn't fit in an email in an 
> easily digestible way. :-)


I'll restate things since I wasn't clear. While this step may be what you need 
for the way you structure things, there might be a better way to structure 
things. I can't tell because I don't know what it is you are trying to do.


> The reason I need to drop into a real RDKit molecule is because I want to be 
> able to attach to any implicit hydrogen for my application. I couldn't think 
> of an easy regular expression that located an atom block with one or more 
> implicit hydrogens.

There isn't one. That requires at least a context-free grammar because it needs 
to count the valence used by branches, and branches can be arbitrarily nested.

I think your "any implicit hydrogen" will have problems when the implicit 
hydrogen count is specified in square brackets, as with a chiral hydrogen, or 
an atom outside of the organic subset, or one with another property specified 
(e.g., isotopes or charge).

Leaving the tricky chiral hydrogen aside, you're turning:

  [C@]([*:9])1(C)C[SiH2]O1

where the silicon has an implicit hydrogen count of 2 and a valence of 4, into 

  C[C@]19C[SiH2]8O1

where the silicon is now 5-valent. Similarly,

  [B-]1OCC[NH+]1
becomes
  [B-]1OCC[NH+]18

If you have some way to annotate which atoms have at least one implicit 
hydrogen then you can use the regular expression from my last email, and if it 
uses []s then reach in and reduce the H count by 1 as part of the 
transformation.

You'll still need some special code to deal with chiral hydrogens.

BTW, I don't think you need closures for this at all. You have a set of 
fragments, where you know which atom will be attached, and I believe you 
control the ordering of the atoms in that fragment.

If you use MolToSmiles(rootedAtAtom) so that the attachment atom is always 
first, e.g. place the "O" in the phenol first:

    Oc1ccccc1

then you can attach it to the core at a given point using a branch, e.g., to 
attach it to the 4th atom of:

  [C@]1(C)CCO1 -> [C@]1(C)CC(Oc1ccccc1)O1

This is the 4th regular expression match (to find the atom, and skip any ring 
closures), followed by '(' followed by the rooted fragment followed by ')' 
followed by the rest of the original string. Plus some modification of the 
regular expression match itself to reduce the H count, if the implicit H-count 
is stated explicitly. Assuming the core attachment point has at least one 
implicit, non-chiral hydrogen on it.

It feels slightly less tricky than the ring closure solution, though still 
tricky.


                                Andrew
                                da...@dalkescientific.com



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

Reply via email to