On Nov 9, 2017, at 16:09, Brian Cole <col...@gmail.com> wrote:
> Here's an example of why this is useful at maintaining molecular 
> fragmentation inside your molecular representation:
>  >>> from rdkit import Chem
>  >>> smiles = 'F9.[C@]91(C)CCO1'
>  >>> fluorine, core = smiles.split('.')
>  >>> fluorine
>  'F9'
>  >>> fragment = core.replace('9', '([*:9])')

Somehow you got the code to generate a "9" for that ring closure, which is not 
something that RDKit does naturally, so we are only seeing a step in the larger 
part of your goal.

The step you gave does a number of transformations to convert:


so the 4th atom has an '8' as an attachment point, that is:


Since you are already comfortable manipulating the SMILES string directly, a 
faster solution is to bypass the toolkit and manipulate the SMILES directly, as 

import re

# Match the SMILES for an atom, followed by its closures
atom_pattern = re.compile(r"""
 Cl? |             # Cl and Br are part of the organic subset
 Br? |
 [NOSPFIbcnosp*] |  # as are these single-letter elements
 \[[^]]*\]         # everything else must be in []s
""", re.X)

smiles = 'F9.[C@]91(C)CCO1'
fluorine, core = smiles.split('.')
matches = list(atom_pattern.finditer(core))
m = matches[3]
new_core = core[:m.end()] + "8" + core[m.end():]

Also, this:

  >>> mol.AddBond(idx, 4, Chem.rdchem.BondType.SINGLE)

is a piece of magic. Where does the 4 come from? RDKit doesn't guarantee that 
the nth atom term in the input SMILES is the same as the nth identifier. It's 
close, but, for example, explicit '[H]' atoms are usually turned into implicit 
hydrogen counts.

Finally, there's another assumption in:
  >>> new_core = new_core.replace('([*:9])', '9').replace('([*:8])', '8')

Sometimes the result will not be inside of ()s. For example, the same 
transformation on:


produces a new_core of:


when you want it to produce:


For what it's worth, the re-based version generates:


On Nov 9, 2017, at 16:27, Chris Earnshaw <cgearns...@gmail.com> wrote:
> Trouble is, you're mixing chemical operations and lexical ones.


> I've written code in the past to do this kind of thing for virtual
> library building, using dummy atoms to mark link positions in the
> fragments, and using Perl code to transform between the dummy atoms
> and bond-closure numbers to give text strings which could be assembled
> to give valid dot-disconnected SMILES. This required additional
> lexical transformations in order to maintain valid SMILES depending on
> where the dummy atom was, and to make sure that stereochemistry worked
> properly. If you want to do this kind of thing I don't think you can
> expect to avoid these additional lexical operations.

This is exactly what mmpdb does, although in Python code. If anyone is 
interested, see 
https://github.com/rdkit/mmpdb/blob/master/mmpdblib/smiles_syntax.py .



Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
Rdkit-discuss mailing list

Reply via email to