I think that here it's worth, at least initially, ignoring what is
currently possible with the RDKit (and how that's implemented) and instead
thinking about what we want to be able to do.[1]
The goal, I think, is to have some options allowing control over how a
query coming from a MOL block/CTAB actually matches target molecules. One
possible model for this would be to look at the options that were available
for searching in systems like ISIS/Host and ISIS/Base (and whatever it is
that they are now called). I no longer have access to those, but I would
guess that someone in the community may or that some googling will turn up
documentation describing/showing the options. I remember there being
options like: "search as drawn", "allow/disallow substitution at
heteroatoms", "allow substitution everywhere", etc. This may be a good
starting point, then we can think about what kind of options we want to add
for interpreting "R" groups or Hs that have been explicitly added to the
drawing.
Does the thought make sense to you guys? Does anyone have access
to/remember better what those search options are?
-greg
[1] all the while keeping somewhere in mind that the core of the RDKit is
really using a more "Daylight-like" model and that there is almost
certainly going to be some mismatch with the MDL model... but we'll worry
about that when we get there.
On Mon, Jun 6, 2016 at 7:04 PM, Brian Kelley <fustiga...@gmail.com> wrote:
> An interesting conversation came up at work a few days ago regarding
> MolBlocks/CTABs with queries that behave in an unexpected manner. I'm
> tackling some of these issues when it comes to reaction processing .rxn
> based files and plan on contributing it relatively soon. However, I hadn't
> considered making it a generic Query based sanitization/processing.
>
>
> The basic question was "How do I get a MolBlock to only match the "R"'s
> and not allow substitutions anywhere else? like ChemAxon..."
>
>
> As it turns out, RDKit is very strict when it looks at RGroups. This was
> the initial issue with when i started Sanitizing RGroups. Basically there
> are several variants in the wild (ChemDraw/ICM) that make reactions that
> don't quite follow the CTAB spec. RDKit likes the atom labled R to (1)
> actually be in an "M RGP" tag and (2) have an atom mapping. If an atom is
> labeled "R" and not in a R_GRP it isn't considered a wild card for instance.
>
> Now queries don't really care about "M RGP", but they do care that it
> isn't a dummy atom. I'm listing below our current technique to fix these
> issues for CTAB queries and would like some feedback.
>
> Here is the workflow that we have been telling chemists during sketching:
>
> 1. Make a proper group. The marvin-sketch/Chemdraw "R" is not enough, you
> can replace it with "A", but R has special semantics and needs an RGroup
> label defined.
> 2. aromatize where appropriate
> 3. (optionally) protonate so only RGroups can match
>
> These line up with the following RDKit code snippets:
>
> 1. Fix the "R"s (note we probably should make proper RGroups, but this
> just add dummy matches)
>
> qmol = rdkit.Chem.MolFromMolblock(molblock)
> # first, change the "R"'s into matching any atoms
> from rdkit.Chem import rdqueries
> qmol = Chem.RWMol(qmol)
> for atom in newpat.GetAtoms():
> if atom.GetAtomicNum() == 0:
> qmol.ReplaceAtom(atom.GetIdx(),
> rdqueries.AtomNumGreaterQueryAtom(0))
>
>
> 2. aromatize - this might be good or might break things. It seems to work
> great, even with conditional logic i.e. [C,O] but I'm unsure which atom is
> actually being used to form the Pi electrons for aromaticity checking. I
> expect the First actually. In anycase, something needs to happen in
> general for random inputs, otherwise the matching doesn't really do what is
> expected.
>
> # We want to see if we can find aromaticity, this may be complicated with
> # query features [C,O] but it works ok.
> Chem.SanitizeMol(qmol, Chem.SANITIZE_SETAROMATICITY)
>
> 3. protonate if the desire is to only match RGroups
>
> # second, add explicit Hs so we only match the Rs
> # I'm unclear if this can fail in general, I would probably wrap this in
> # a try...except block
> Chem.SanitizeMol(qmol, Chem.SANITIZE_ADJUSTHS)
> qmol = Chem.MergeQueryHs(Chem.AddHs(qmol))
>
> This could be enabled with flags into a SanitizeQuery function, or perhaps
> a PrepareQuery function.
>
> Thoughts?
>
> Cheers,
> Brian
>
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and
> traffic
> patterns at an interface-level. Reveals which users, apps, and protocols
> are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss