Hi Greg,

Thanks for the advice. This appears to be what I want.

Rgds,
Robert

From: Greg Landrum <greg.land...@gmail.com>
Sent: 11 August 2020 07:52
To: Burbidge Robert (Hyper Recruitment Solutions Limited) 
<robert.burbi...@ucb.com>
Cc: rdkit-discuss@lists.sourceforge.net
Subject: [External] Re: [Rdkit-discuss] Matching SMILES to SMARTS

Hi Robert,


On Mon, Aug 10, 2020 at 8:09 PM Burbidge Robert (Hyper Recruitment Solutions 
Limited) <robert.burbi...@ucb.com<mailto:robert.burbi...@ucb.com>> wrote:
Hi,

Newbie here. I have a list of SMARTS strings and a list of SMILES strings. For 
each SMARTS string I would like to get the SMILES strings that are valid 
instantiations of the SMARTS string. I am using the Python API. I have gotten 
this far:

from rdkit import Chem
z3 = 
Chem.MolFromSmarts(“[C:1]-[C@@H;D3;+0:2](-[C;D1;H3:3])-[C@@H;D3;+0:4](-[C:5])-[O;H1;D1;+0]”)
z2 = 
Chem.MolFromSmiles(“[C@:1]([C@@:17]([H:18])([C:14]1([CH3:16])[CH3:15])[C@:12]1([H:13])[CH2:11][CH2:10]2)([C@@:8]23[CH3:9])([C@@H:6]([CH3:7])[C@H:5](O)[CH2:4]4)[C@:2]34[H:3]”)
if z3.HasSubstructMatch(z2):
               # do something

You've got the order backwards here. You want to ask the molecule whether or 
not it has a substruct match to the query. So you'd do: z3.HasSubstructMatch(z2)
That returns cases where the entire query (z2) matches part or all of the 
molecule (z3).

As for making the search efficient: this is a problem which is embarrassingly 
parallel and you have a lot of memory, so you could parse all the molecules 
into a python list and then split running the individual queries across all of 
your CPUs (GPUs won't help here) using one of the python libraries for doing 
parallel work (multiprocessing or ipyparallel would probably both work.

-greg




This however would include cases where the SMILES matched only a sub-structure 
of the SMARTS, whereas I am looking for complete matches. For example, 
trivially, if the SMARTS represented several disjoint molecules separated by 
‘.’ or a reaction with reactants and products separated by ‘>>’ then I would 
still get a match, which I don’t want. As it happens, I know that neither of 
these cases occur in my current dataset, but they might do in others; and I am 
not a chemist, so I don’t know whether it’s possible for a proper substructure 
to match without matching the whole SMARTS. I can’t find anything in the RDKit 
documentation or elsewhere online about this, but I am probably not using the 
right terminology to search.

Also, my two datasets both have about 18 million records in them and for the 
purposes of this question let’s assume they are not canonical, so efficiency is 
also an issue. I have 96 CPUs, 8 GPUs, and up to 376G RAM at my disposal.

Thanks in advance,
Robert
________________________________
Legal Notice: This electronic mail and its attachments are intended solely for 
the person(s) to whom they are addressed and contain information which is 
confidential or otherwise protected from disclosure, except for the purpose for 
which they are intended. Dissemination, distribution, or reproduction by anyone 
other than the intended recipients is prohibited and may be illegal. If you are 
not an intended recipient, please immediately inform the sender and return the 
electronic mail and its attachments and destroy any copies which may be in your 
possession. UCB screens electronic mails for viruses but does not warrant that 
this electronic mail is free of any viruses. UCB accepts no liability for any 
damage caused by any virus transmitted by this electronic mail. (Ref: #*UG1107) 
[Ref-UG1107]
________________________________
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Frdkit-discuss&data=02%7C01%7CRobert.Burbidge%40ucb.com%7C4e5527d35afb42ce65bb08d83dc30cff%7C237582ad3eab4d44868806ca9f2e613b%7C0%7C0%7C637327255232670684&sdata=fGjGTtmBXipUdtUtxn1UtRkkCKiNPb0Mohcm75n7REk%3D&reserved=0>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to