Hi Robert,

On Mon, Aug 10, 2020 at 8:09 PM Burbidge Robert (Hyper Recruitment
Solutions Limited) <robert.burbi...@ucb.com> wrote:

> Hi,
>
>
>
> Newbie here. I have a list of SMARTS strings and a list of SMILES strings.
> For each SMARTS string I would like to get the SMILES strings that are
> valid instantiations of the SMARTS string. I am using the Python API. I
> have gotten this far:
>
>
>
> from rdkit import Chem
>
> z3 = Chem.MolFromSmarts(“[C:1]-[C@@H;D3;+0:2](-[C;D1;H3:3])-[C@
> @H;D3;+0:4](-[C:5])-[O;H1;D1;+0]”)
>
> z2 = Chem.MolFromSmiles(“[C@:1]([C@
> @:17]([H:18])([C:14]1([CH3:16])[CH3:15])[C@
> :12]1([H:13])[CH2:11][CH2:10]2)([C@@:8]23[CH3:9])([C@@H:6]([CH3:7])[C@H
> :5](O)[CH2:4]4)[C@:2]34[H:3]”)
>
> if z3.HasSubstructMatch(z2):
>
>                # do something
>
>
You've got the order backwards here. You want to ask the molecule whether
or not it has a substruct match to the query. So you'd do:
z3.HasSubstructMatch(z2)
That returns cases where the entire query (z2) matches part or all of the
molecule (z3).

As for making the search efficient: this is a problem which is
embarrassingly parallel and you have a lot of memory, so you could parse
all the molecules into a python list and then split running the individual
queries across all of your CPUs (GPUs won't help here) using one of the
python libraries for doing parallel work (multiprocessing or ipyparallel
would probably both work.

-greg




>
>
> This however would include cases where the SMILES matched only a
> sub-structure of the SMARTS, whereas I am looking for complete matches. For
> example, trivially, if the SMARTS represented several disjoint molecules
> separated by ‘.’ or a reaction with reactants and products separated by
> ‘>>’ then I would still get a match, which I don’t want. As it happens, I
> know that neither of these cases occur in my current dataset, but they
> might do in others; and I am not a chemist, so I don’t know whether it’s
> possible for a proper substructure to match without matching the whole
> SMARTS. I can’t find anything in the RDKit documentation or elsewhere
> online about this, but I am probably not using the right terminology to
> search.
>
>
>
> Also, my two datasets both have about 18 million records in them and for
> the purposes of this question let’s assume they are not canonical, so
> efficiency is also an issue. I have 96 CPUs, 8 GPUs, and up to 376G RAM at
> my disposal.
>
>
>
> Thanks in advance,
>
> Robert
> ------------------------------
> Legal Notice: This electronic mail and its attachments are intended solely
> for the person(s) to whom they are addressed and contain information which
> is confidential or otherwise protected from disclosure, except for the
> purpose for which they are intended. Dissemination, distribution, or
> reproduction by anyone other than the intended recipients is prohibited and
> may be illegal. If you are not an intended recipient, please immediately
> inform the sender and return the electronic mail and its attachments and
> destroy any copies which may be in your possession. UCB screens electronic
> mails for viruses but does not warrant that this electronic mail is free of
> any viruses. UCB accepts no liability for any damage caused by any virus
> transmitted by this electronic mail. (Ref: #*UG1107) [Ref-UG1107]
> ------------------------------
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to