On Mon, Dec 3, 2012 at 3:40 PM, Andrew Dalke <[email protected]>wrote:

> What are the steps one must to to use an input structure (from
> a SMILES string) as a substructure query? It looks like I need
> to remove explicit hydrogens [* see footnote]. Is there anything
> else? And what is the right way to remove explicit hydrogens?
>

There's unfortunately some uginess here. I'll explain briefly what the
problem is below along with how to work around it.


>
> I'm working again on a project to do substructure searches.
>
> In the data flow, the user sketches a structure, which gets
> turned into a SMILES string. Let's suppose that sketched
> SMILES is c1cc2ccccc2[nH]1 .
>
> The dead-simple, brute-force solution is:
>
>   query = Chem.MolFromSmiles("c1cc2ccccc2[nH]1")
>   for mol in dataset:
>       if mol.HasSubstructMatch(query):
>           print "Match!", mol.GetProp("_Name")
>
> However, that doesn't work like I hoped it would. Consider these:
>
> >>> mol = Chem.MolFromSmiles("c1c(C)c2c(N)cccc2[nH]1")
> >>> mol.HasSubstructMatch(query)
> True
> >>> mol = Chem.MolFromSmiles("Fn1cc(Cl)c2ccccc21")
> >>> mol.HasSubstructMatch(query)
> False
>
> I expected the sketched structure to match both structures,
> and not just the first. The failure appears to be because
> of the explicit hydrogen in the '[nH]' term. If I remove the
> 'H' then the match is fine.
>
> >>> for atom in query.GetAtoms():
> ...     print atom.GetNumExplicitHs(),
> ... else:
> ...     print
> ...
> 0 0 0 0 0 0 0 0 1
> >>> for atom in query.GetAtoms():
> ...     atom.SetNumExplicitHs(0)
> ...
> >>> mol.HasSubstructMatch(query)
> True
>
> Is a description of how atoms and bonds are matched, when the
> given substructure comes from a molecule and not a SMARTS,
> available somewhere?
>

Yes, it's here:
http://www.rdkit.org/docs/RDKit_Book.html#atom-atom-matching-in-substructure-queries


> Finally, am I missing anything else in what I need to do
> in order to prepare an input substructure as a substructure
> query?
>

The problem is that the aromatic N in the query has, according to the
RDKit, an explicit H. So when the query executes, it uses that explicit H
as part of the matching criteria (see the link above). This is plainly
wrong, but the best fix to the problem is going to require some
re-imagining of how the RDKit handles hydrogen atoms. I've been wanting to
do this for a while, but it's potentially a code-breaking change, so I've
been avoiding it. I'll start a thread on the rdkit-devel list about this,
in case anyone wants to participate.

In the meantime, your workaround of setting the number of explicit Hs to
zero on atoms in the query molecule should solve the problem.


> [*]
>   RDKit uses a different terminology than Daylight/OEChem.
> In RDKit, "O" has implicit hydrogens, "[OH2]" has explicit
> hydrogens, and "[2H]O[2H]" has two hydrogen atoms. In Daylight,
> the first two are different ways of writing implicit hydrogens,
> and the third has two explicit hydrogen atoms.
>   This is not obvious from the Daylight documentation, and
> various toolkits do it in different ways.
>

Yeah, the RDKit nomenclature about explicit Hs doesn't make much sense;
this is part of what needs to be changed.

-greg
------------------------------------------------------------------------------
Keep yourself connected to Go Parallel: 
BUILD Helping you discover the best ways to construct your parallel projects.
http://goparallel.sourceforge.net
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to