On Nov 9, 2017, at 16:09, Brian Cole <col...@gmail.com> wrote: > Here's an example of why this is useful at maintaining molecular > fragmentation inside your molecular representation: > > >>> from rdkit import Chem > >>> smiles = 'F9.[C@]91(C)CCO1' > >>> fluorine, core = smiles.split('.') > >>> fluorine > 'F9' > >>> fragment = core.replace('9', '([*:9])')
Somehow you got the code to generate a "9" for that ring closure, which is not something that RDKit does naturally, so we are only seeing a step in the larger part of your goal. The step you gave does a number of transformations to convert: [C@]91(C)CCO1 so the 4th atom has an '8' as an attachment point, that is: [C@]91(C)CC8O1 Since you are already comfortable manipulating the SMILES string directly, a faster solution is to bypass the toolkit and manipulate the SMILES directly, as in: ######## import re # Match the SMILES for an atom, followed by its closures atom_pattern = re.compile(r""" ( Cl? | # Cl and Br are part of the organic subset Br? | [NOSPFIbcnosp*] | # as are these single-letter elements \[[^]]*\] # everything else must be in []s ) """, re.X) smiles = 'F9.[C@]91(C)CCO1' fluorine, core = smiles.split('.') matches = list(atom_pattern.finditer(core)) m = matches[3] new_core = core[:m.end()] + "8" + core[m.end():] print(new_core) ######## Also, this: >>> mol.AddBond(idx, 4, Chem.rdchem.BondType.SINGLE) is a piece of magic. Where does the 4 come from? RDKit doesn't guarantee that the nth atom term in the input SMILES is the same as the nth identifier. It's close, but, for example, explicit '[H]' atoms are usually turned into implicit hydrogen counts. Finally, there's another assumption in: >>> new_core = new_core.replace('([*:9])', '9').replace('([*:8])', '8') Sometimes the result will not be inside of ()s. For example, the same transformation on: F9.[C@]91(C)C(C)O1 produces a new_core of: C[C@@]19OC1C[*:8] when you want it to produce: C[C@@]19OC1C8 For what it's worth, the re-based version generates: [C@]91(C)C(C8)O1 On Nov 9, 2017, at 16:27, Chris Earnshaw <cgearns...@gmail.com> wrote: > Trouble is, you're mixing chemical operations and lexical ones. Agreed. > I've written code in the past to do this kind of thing for virtual > library building, using dummy atoms to mark link positions in the > fragments, and using Perl code to transform between the dummy atoms > and bond-closure numbers to give text strings which could be assembled > to give valid dot-disconnected SMILES. This required additional > lexical transformations in order to maintain valid SMILES depending on > where the dummy atom was, and to make sure that stereochemistry worked > properly. If you want to do this kind of thing I don't think you can > expect to avoid these additional lexical operations. This is exactly what mmpdb does, although in Python code. If anyone is interested, see https://github.com/rdkit/mmpdb/blob/master/mmpdblib/smiles_syntax.py . Cheers, Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss