On Nov 9, 2017, at 21:49, Brian Cole <col...@gmail.com> wrote: > Certainly, but thousands of lines of Python doesn't fit in an email in an > easily digestible way. :-)
I'll restate things since I wasn't clear. While this step may be what you need for the way you structure things, there might be a better way to structure things. I can't tell because I don't know what it is you are trying to do. > The reason I need to drop into a real RDKit molecule is because I want to be > able to attach to any implicit hydrogen for my application. I couldn't think > of an easy regular expression that located an atom block with one or more > implicit hydrogens. There isn't one. That requires at least a context-free grammar because it needs to count the valence used by branches, and branches can be arbitrarily nested. I think your "any implicit hydrogen" will have problems when the implicit hydrogen count is specified in square brackets, as with a chiral hydrogen, or an atom outside of the organic subset, or one with another property specified (e.g., isotopes or charge). Leaving the tricky chiral hydrogen aside, you're turning: [C@]([*:9])1(C)C[SiH2]O1 where the silicon has an implicit hydrogen count of 2 and a valence of 4, into C[C@]19C[SiH2]8O1 where the silicon is now 5-valent. Similarly, [B-]1OCC[NH+]1 becomes [B-]1OCC[NH+]18 If you have some way to annotate which atoms have at least one implicit hydrogen then you can use the regular expression from my last email, and if it uses []s then reach in and reduce the H count by 1 as part of the transformation. You'll still need some special code to deal with chiral hydrogens. BTW, I don't think you need closures for this at all. You have a set of fragments, where you know which atom will be attached, and I believe you control the ordering of the atoms in that fragment. If you use MolToSmiles(rootedAtAtom) so that the attachment atom is always first, e.g. place the "O" in the phenol first: Oc1ccccc1 then you can attach it to the core at a given point using a branch, e.g., to attach it to the 4th atom of: [C@]1(C)CCO1 -> [C@]1(C)CC(Oc1ccccc1)O1 This is the 4th regular expression match (to find the atom, and skip any ring closures), followed by '(' followed by the rooted fragment followed by ')' followed by the rest of the original string. Plus some modification of the regular expression match itself to reduce the H count, if the implicit H-count is stated explicitly. Assuming the core attachment point has at least one implicit, non-chiral hydrogen on it. It feels slightly less tricky than the ring closure solution, though still tricky. Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss