Hi,
Newbie here. I have a list of SMARTS strings and a list of SMILES strings. For
each SMARTS string I would like to get the SMILES strings that are valid
instantiations of the SMARTS string. I am using the Python API. I have gotten
this far:
from rdkit import Chem
z3 =
Chem.MolFromSmarts("[C:1]-[C@@H;D3;+0:2](-[C;D1;H3:3])-[C@@H;D3;+0:4](-[C:5])-[O;H1;D1;+0]")
z2 =
Chem.MolFromSmiles("[C@:1]([C@@:17]([H:18])([C:14]1([CH3:16])[CH3:15])[C@:12]1([H:13])[CH2:11][CH2:10]2)([C@@:8]23[CH3:9])([C@@H:6]([CH3:7])[C@H:5](O)[CH2:4]4)[C@:2]34[H:3]")
if z3.HasSubstructMatch(z2):
# do something
This however would include cases where the SMILES matched only a sub-structure
of the SMARTS, whereas I am looking for complete matches. For example,
trivially, if the SMARTS represented several disjoint molecules separated by
'.' or a reaction with reactants and products separated by '>>' then I would
still get a match, which I don't want. As it happens, I know that neither of
these cases occur in my current dataset, but they might do in others; and I am
not a chemist, so I don't know whether it's possible for a proper substructure
to match without matching the whole SMARTS. I can't find anything in the RDKit
documentation or elsewhere online about this, but I am probably not using the
right terminology to search.
Also, my two datasets both have about 18 million records in them and for the
purposes of this question let's assume they are not canonical, so efficiency is
also an issue. I have 96 CPUs, 8 GPUs, and up to 376G RAM at my disposal.
Thanks in advance,
Robert
________________________________
Legal Notice: This electronic mail and its attachments are intended solely for
the person(s) to whom they are addressed and contain information which is
confidential or otherwise protected from disclosure, except for the purpose for
which they are intended. Dissemination, distribution, or reproduction by anyone
other than the intended recipients is prohibited and may be illegal. If you are
not an intended recipient, please immediately inform the sender and return the
electronic mail and its attachments and destroy any copies which may be in your
possession. UCB screens electronic mails for viruses but does not warrant that
this electronic mail is free of any viruses. UCB accepts no liability for any
damage caused by any virus transmitted by this electronic mail. (Ref: #*UG1107)
[Ref-UG1107]
________________________________
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss