Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

Brian Cole Thu, 09 Nov 2017 12:50:44 -0800

>
> Somehow you got the code to generate a "9" for that ring closure, which is
> not something that RDKit does naturally, so we are only seeing a step in
> the larger part of your goal.
>


Certainly, but thousands of lines of Python doesn't fit in an email in an
easily digestible way. :-)


> Since you are already comfortable manipulating the SMILES string directly,
> a faster solution is to bypass the toolkit and manipulate the SMILES
> directly, as in:
>
> ########
> import re
>
> # Match the SMILES for an atom, followed by its closures
> atom_pattern = re.compile(r"""
> (
>  Cl? |             # Cl and Br are part of the organic subset
>  Br? |
>  [NOSPFIbcnosp*] |  # as are these single-letter elements
>  \[[^]]*\]         # everything else must be in []s
> )
> """, re.X)
>
> smiles = 'F9.[C@]91(C)CCO1'
> fluorine, core = smiles.split('.')
> matches = list(atom_pattern.finditer(core))
> m = matches[3]
> new_core = core[:m.end()] + "8" + core[m.end():]
> print(new_core)
> ########
>

The reason I need to drop into a real RDKit molecule is because I want to
be able to attach to any implicit hydrogen for my application. I couldn't
think of an easy regular expression that located an atom block with one or
more implicit hydrogens. So I drop into an RDKit molecule for that part to
figure out where are possible hydrogens for me to replace with a functional
group.


> Also, this:
>
>   >>> mol.AddBond(idx, 4, Chem.rdchem.BondType.SINGLE)
>
> is a piece of magic. Where does the 4 come from? RDKit doesn't guarantee
> that the nth atom term in the input SMILES is the same as the nth
> identifier. It's close, but, for example, explicit '[H]' atoms are usually
> turned into implicit hydrogen counts.
>

Hence the reason I use this to actually parse the SMILES:

def MolFromSmilesWithHydrogen(smiles):
    params = Chem.rdmolfiles.SmilesParserParams()
    params.removeHs = False
    return Chem.MolFromSmiles(smiles, params)

Even so, in the actual application the atom indices do refer to an actual
RDKit molecule that has been scanned for implicit hydrogen locations. Was
just trying to keep it 'email simple'.


> > I've written code in the past to do this kind of thing for virtual
> > library building, using dummy atoms to mark link positions in the
> > fragments, and using Perl code to transform between the dummy atoms
> > and bond-closure numbers to give text strings which could be assembled
> > to give valid dot-disconnected SMILES. This required additional
> > lexical transformations in order to maintain valid SMILES depending on
> > where the dummy atom was, and to make sure that stereochemistry worked
> > properly. If you want to do this kind of thing I don't think you can
> > expect to avoid these additional lexical operations.
>
> This is exactly what mmpdb does, although in Python code. If anyone is
> interested, see https://github.com/rdkit/mmpdb/blob/master/mmpdblib/
> smiles_syntax.py .
>

And I've totally stole your idea and ran with it over the past year or so.
:-)

Hoping I can talk about it and maybe even open-source it sometime. Want to
hook it up to mmpdb if can as well.

Cheers,
Brian

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

Reply via email to