An interesting conversation came up at work a few days ago regarding
MolBlocks/CTABs with queries that behave in an unexpected manner.  I'm
tackling some of these issues when it comes to reaction processing .rxn
based files and plan on contributing it relatively soon.  However, I hadn't
considered making it a generic Query based sanitization/processing.


The basic question was "How do I get a MolBlock to only match the "R"'s and
not allow substitutions anywhere else? like ChemAxon..."


As it turns out, RDKit is very strict when it looks at RGroups.  This was
the initial issue with when i started Sanitizing RGroups.  Basically there
are several variants in the wild (ChemDraw/ICM) that make reactions that
don't quite follow the CTAB spec.  RDKit likes the atom labled R to (1)
actually be in an "M  RGP" tag and (2) have an atom mapping.  If an atom is
labeled "R" and not in a R_GRP it isn't considered a wild card for instance.

Now queries don't really care about "M  RGP", but they do care that it
isn't a dummy atom.  I'm listing below our current technique to fix these
issues for CTAB queries and would like some feedback.

Here is the workflow that we have been telling chemists during sketching:

1. Make a proper group.  The marvin-sketch/Chemdraw "R" is not enough, you
can replace it with "A", but R has special semantics and needs an RGroup
label defined.
2. aromatize where appropriate
3. (optionally) protonate so only RGroups can match

These line up with the following RDKit code snippets:

1. Fix the "R"s (note we probably should make proper RGroups, but this just
add dummy matches)

qmol = rdkit.Chem.MolFromMolblock(molblock)
# first, change the "R"'s into matching any atoms
from rdkit.Chem import rdqueries
qmol = Chem.RWMol(qmol)
for atom in newpat.GetAtoms():
    if atom.GetAtomicNum() == 0:
       qmol.ReplaceAtom(atom.GetIdx(), rdqueries.AtomNumGreaterQueryAtom(0))


2. aromatize - this might be good or might break things.  It seems to work
great, even with conditional logic i.e. [C,O] but I'm unsure which atom is
actually being used to form the Pi electrons for aromaticity checking.  I
expect the First actually.  In anycase, something needs to happen in
general for random inputs, otherwise the matching doesn't really do what is
expected.

# We want to see if we can find aromaticity, this may be complicated with
#  query features [C,O] but it works ok.
Chem.SanitizeMol(qmol, Chem.SANITIZE_SETAROMATICITY)

3. protonate if the desire is to only match RGroups

# second, add explicit Hs so we only match the Rs
# I'm unclear if this can fail in general, I would probably wrap this in
#  a try...except block
Chem.SanitizeMol(qmol, Chem.SANITIZE_ADJUSTHS)
qmol = Chem.MergeQueryHs(Chem.AddHs(qmol))

This could be enabled with flags into a SanitizeQuery function, or perhaps
a PrepareQuery function.

Thoughts?

Cheers,
 Brian
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to