On Nov 19, 2020, at 17:48, Gustavo Seabra <[email protected]> wrote: Is it possible to search for *partial* substructure matches using RDKit?... For example, if the pattern is a naphthalene and the molecule to A number of people pointed out that RDKit's MCS feature might be appropriate. I've attached an example program based around that. For example, the default is your two structures: % python mcs_search.py No --query specified, using naphthalene as the default. No --target or --targets specified, using phenol as the default. Target_ID: phenol nAtoms: 7 nBonds: 7 match_nAtoms: 6 match_nBonds: 6 atom_overlap: 0.600 bond_overlap: 0.545 atom_Tanimoto: 0.545 bond_Tanimoto: 0.500 I'll reverse it by specifying the SMILES on the command-line. % python mcs_search.py --query 'c1ccccc1O' --target 'c1ccc2ccccc2c1' Target_ID: query nAtoms: 10 nBonds: 11 match_nAtoms: 6 match_nBonds: 6 atom_overlap: 0.857 bond_overlap: 0.857 atom_Tanimoto: 0.545 bond_Tanimoto: 0.500 The program includes options to configure the FindMCS() parameters. In addition, if chemfp 3.x is installed then some additional features are available, like the following example, which applies the MCS search to all records in ChEBI: % python mcs_search.py --query 'COC(=O)C1C(OC(=O)c2ccccc2)CC2CCC1N2C' --targets ~/databases/ChEBI_lite.sdf.gz --id-tag 'ChEBI ID' Target_ID nAtoms nBonds match_nAtoms match_nBonds atom_overlap bond_overlap atom_Tanimoto bond_Tanimoto CHEBI:776 21 24 9 8 0.409 0.333 0.265 0.200 CHEBI:1148 7 6 6 5 0.273 0.208 0.261 0.200 CHEBI:1734 19 21 16 15 0.727 0.625 0.640 0.500 CHEBI:1895 9 9 9 8 0.409 0.333 0.409 0.320 ... On Nov 20, 2020, at 15:56, Gustavo Seabra <[email protected]> wrote: Andrew [email protected] |
# Example of how to use RDKit's MCS search to find the MCS similarity # between two compounds.
EPILOG = """
The default --query is naphthalene.
The default --target is phenol.
The default output format when '--targets' is not specified is 'flat'.
% python mcs_search.py
No --query specified, using naphthalene as the default.
No --target or --targets specified, using phenol as the default.
Target_ID: phenol
nAtoms: 7
nBonds: 7
match_nAtoms: 6
match_nBonds: 6
atom_overlap: 0.600
bond_overlap: 0.545
atom_Tanimoto: 0.545
bond_Tanimoto: 0.500
The first two lines of the above, both starting "No --", are written
to stderr. The other lines were written to --output/-o, which default
to stdout.
The output lines are:
nAtoms -- number of atoms in the target structure
nBonds -- number of bonds in the target structure
match_nAtoms -- number of atoms in the MCS match
match_nBonds -- number of bonds in the MCS match
atom_overlap -- number of atoms in the MCS match /
number of atoms in the query structure
bond_overlap -- number of bonds in the MCS match /
number of bonds in the query structure
atom_Tanimoto -- number of atoms in the MCS match /
(query num atoms + target num atoms - match num atoms)
bond_Tanimoto -- number of bonds in the MCS match /
(query num bonds + target num bonds - match num bond)
Use "--out tsv" to write the output as tab-separated fields, with a
header column.
By default, the --query and --target are interpreted as SMILES strings:
% python mcs_search.py --query CCCO --target NOC
Target_ID: query
nAtoms: 3
nBonds: 2
match_nAtoms: 2
match_nBonds: 1
atom_overlap: 0.500
bond_overlap: 0.333
atom_Tanimoto: 0.400
bond_Tanimoto: 0.250
Use --query-format and --target-format to specify an alternative format.
NOTE! This feature requires that chemfp 3.x or later be installed
% python mcs_search.py --query 'InChI=1S/C7H8/c1-7-5-3-2-4-6-7/h2-6H,1H3' --query-format inchi
No --target or --targets specified, using phenol as the default.
Target_ID: phenol
nAtoms: 7
nBonds: 7
match_nAtoms: 6
match_nBonds: 6
atom_overlap: 0.857
bond_overlap: 0.857
atom_Tanimoto: 0.750
bond_Tanimoto: 0.750
Use --targets to process all of the structures in a named file.
NOTE! This feature requires that chemfp 3.x or later be installed
% python mcs_search.py --query 'COC(=O)C1C(OC(=O)c2ccccc2)CC2CCC1N2C' --targets ~/databases/chebi.smi
Target_ID nAtoms nBonds match_nAtoms match_nBonds atom_overlap bond_overlap atom_Tanimoto bond_Tanimoto
CHEBI:90 21 23 17 17 0.773 0.708 0.654 0.567
CHEBI:165 11 12 9 8 0.409 0.333 0.375 0.286
CHEBI:598 10 9 6 5 0.273 0.208 0.231 0.179
CHEBI:776 21 24 9 8 0.409 0.333 0.265 0.200
...
The command-line options support a few o the chemfp reader configurations.
For example, the following uses the 'ChEBI ID' tag of the records
in the gzip-ed SDF to get the record name.
% python mcs_search.py --query 'COC(=O)C1C(OC(=O)c2ccccc2)CC2CCC1N2C' --targets ~/databases/ChEBI_lite.sdf.gz --id-tag 'ChEBI ID'
Target_ID nAtoms nBonds match_nAtoms match_nBonds atom_overlap bond_overlap atom_Tanimoto bond_Tanimoto
CHEBI:776 21 24 9 8 0.409 0.333 0.265 0.200
CHEBI:1148 7 6 6 5 0.273 0.208 0.261 0.200
CHEBI:1734 19 21 16 15 0.727 0.625 0.640 0.500
CHEBI:1895 9 9 9 8 0.409 0.333 0.409 0.320
...
Finally, nearly all of the MCS parameters can be configured on the command-line.
This program was written by Andrew Dalke <[email protected]> .
"""
import sys
import argparse
from rdkit import Chem
from rdkit.Chem import rdFMCS
def get_chemfp_rdkit_toolkit(option):
try:
import chemfp
except ImportError:
sys.stderr.write("chemfp must be installed to use the %s argument.\n" % (option,))
sys.stderr.write("Install it with:\n")
sys.stderr.write(" python -m pip install chemfp -i https://chemfp.com/packages\n")
raise SystemExit(10)
if not hasattr(chemfp, "__version_info__") or chemfp.__version_info__[0] < 3:
sys.stderr.write("A newer version of chemfp must be installed to use the %s argument." % (option,))
sys.stderr.write("Upgrade your version with:\n")
sys.stderr.write(" python -m pip install chemfp -i https://chemfp.com/packages --upgrade\n")
raise SystemExit(10)
from chemfp import rdkit_toolkit
return rdkit_toolkit
parser = argparse.ArgumentParser(
description="Compute MCS similarity scores between a query and target structure or structures",
epilog = EPILOG,
formatter_class=argparse.RawDescriptionHelpFormatter,
)
# Query / target
parser.add_argument("--query", metavar="SMILES",
help="default query SMILES (default: naphthalene)")
target_group = parser.add_mutually_exclusive_group()
target_group.add_argument("--target", metavar="SMILES",
help="default target SMILES (default: phenol)")
target_group.add_argument("--targets", metavar="FILENAME",
help="read target structures from FILENAME; use '-' for stdin")
# MCS arguments
parser.add_argument("--maximize-atoms", action="store_true",
help="maximize the number of atoms (the default maximizes the number of atoms")
parser.add_argument("--match-valences", action="store_true",
help="atom matches must have the same valence")
parser.add_argument("--ring-matches-ring-only", action="store_true",
help="ring atoms must only match ring atoms")
parser.add_argument("--complete-rings-only", action="store_true",
help="only allow complete rings to match")
parser.add_argument("--match-chiral-tag", action="store_true",
help="atom matches must have the same chiral tag")
parser.add_argument("--atom-compare", choices=("Any", "AnyHeavyAtom", "Elements", "Isotopes"),
default="Elements",
help="method for comparing two atoms (default: Elements)")
parser.add_argument("--bond-compare", choices=("Any", "Order", "OrderExact"),
default="Order",
help="method for comparing two bonds (default: Order")
parser.add_argument("--ring-compare", choices=("IgnoreRingFusion", "PermissiveRingFusion", "StrictRingFusion"),
default="IgnoreRingFusion",
help="method for handling ring fusion (default: 'IgnoreRingFusion'")
_atom_comparisons = {
"Any": rdFMCS.AtomCompare.CompareAny,
"AnyHeavyAtom": rdFMCS.AtomCompare.CompareAnyHeavyAtom,
"Elements": rdFMCS.AtomCompare.CompareElements,
"Isotopes": rdFMCS.AtomCompare.CompareIsotopes,
}
_bond_comparisons = {
"Any": rdFMCS.BondCompare.CompareAny,
"Order": rdFMCS.BondCompare.CompareOrder,
"OrderExact": rdFMCS.BondCompare.CompareOrderExact,
}
_ring_comparisons = {
"IgnoreRingFusion": rdFMCS.RingCompare.IgnoreRingFusion,
"PermissiveRingFusion": rdFMCS.RingCompare.PermissiveRingFusion,
"StrictRingFusion": rdFMCS.RingCompare.StrictRingFusion,
}
# Configure how to parse the query and target(s)
parser.add_argument("--query-format", metavar="FORMAT",
help="query format (default is 'smi' for SMILES)")
parser.add_argument("--target-format", metavar="FORMAT",
help="target format (default: based on filename extension, defaults to SMILES")
parser.add_argument("--id-tag", metavar="STR",
help="tag containing the identifier, use if --target is an SDF")
parser.add_argument("--has-header", action="store_true",
help="if targets is a SMILES file, the first line contains the header")
parser.add_argument("--delimiter", choices=("to-eol", "tab", "whitespace", "space"),
help="if targets is a SMILES file, the delimiter style (default: 'to-eol'")
# Configure the output
parser.add_argument("-o", "--output", metavar="FILENAME",
help="write output to FILENAME")
parser.add_argument("--out", choices=("flat", "tsv"),
help="output style (default is 'flat' for --target and 'tsv' for --targets)")
def main():
args = parser.parse_args()
mcs_kwargs = {
"maximizeBonds": not args.maximize_atoms,
"matchValences": args.match_valences,
"ringMatchesRingOnly": args.ring_matches_ring_only,
"completeRingsOnly": args.complete_rings_only,
"matchChiralTag": args.match_chiral_tag,
"atomCompare": _atom_comparisons[args.atom_compare],
"bondCompare": _bond_comparisons[args.bond_compare],
"ringCompare": _ring_comparisons[args.ring_compare],
}
if args.query is None:
sys.stderr.write("No --query specified, using naphthalene as the default.\n")
query = Chem.MolFromSmiles("c1ccc2ccccc2c1") # naphthalene is the default query
else:
if args.query_format is None:
query = Chem.MolFromSmiles(args.query)
if query is None:
parser.error("Unable to parse --query as a SMILES string")
else:
# Use chemfp to process the --query
T = get_chemfp_rdkit_toolkit("--query")
try:
fmt = T.get_input_format(args.query_format)
except ValueError as err:
parser.error("Cannot use --query-format %s: %s" % (
args.query_format, err))
try:
query = T.parse_molecule(args.query, fmt)
except ValueError as err:
parser.error("Cannot parse --query as a %s record: %s" % (
args.query_format, err))
query_num_atoms = query.GetNumAtoms()
query_num_bonds = query.GetNumBonds()
headers = [
"Target_ID", "nAtoms", "nBonds", "match_nAtoms", "match_nBonds",
"atom_overlap", "bond_overlap", "atom_Tanimoto", "bond_Tanimoto"]
field_formats = [
"%s", "%d", "%d", "%d", "%d",
"%.3f", "%.3f", "%.3f", "%.3f"]
assert len(headers) == len(field_formats)
output_format = args.out
if args.targets is None:
if output_format is None:
output_format = "flat"
targets_close = None
# Use a single target
if args.target is None:
sys.stderr.write("No --target or --targets specified, using phenol as the default.\n")
target = Chem.MolFromSmiles("c1ccccc1O") # phenol is the default target
targets = [("phenol", target)]
elif args.target_format is None:
target = Chem.MolFromSmiles(args.target)
if target is None:
parser.error("Unable to parse --target as a SMILES string")
targets = [("query", target)]
else:
# Use chemfp to process the --target
T = get_chemfp_rdkit_toolkit("--target-format")
try:
fmt = T.get_input_format(args.target_format)
except ValueError as err:
parser.error("Cannot use --target-format %s: %s" % (
args.query_format, err))
try:
target = T.parse_molecule(args.target, fmt, id_tag=args.id_tag)
except ValueError as err:
parser.error("Cannot parse --target as a %s record: %s" % (
args.target_format, err))
targets = [("query", target)]
else:
if output_format is None:
output_format = "tsv"
# Read from the named file
# Use chemfp to process the --targets
T = get_chemfp_rdkit_toolkit("--targets")
try:
fmt = T.get_input_format(args.target_format)
except ValueError as err:
parser.error("Cannot use --target-format %s: %s" % (
args.query_format, err))
if not fmt.supports_io:
parser.error("chemfp does not support reading from --target-format %s files" % (
args.target_format,))
source = args.targets
if source == "-":
source = None
try:
targets = T.read_ids_and_molecules(
source, format=args.target_format,
id_tag=args.id_tag,
reader_args={
"delimiter": args.delimiter,
"has_header": args.has_header,
},
errors="report")
except IOError as err:
parser.error("Cannot open --targets file: %s" % (err,))
targets_close = targets.close
outfile_close = None
try:
# open the --output, or use stdout
if args.output is None:
outfile = sys.stdout
else:
try:
outfile = open(args.output, "w")
except IOError as err:
parser.error("Cannot open --output file: %s" % (err,))
outfile_close = outfile.close
if output_format == "tsv":
outfile.write("\t".join(headers) + "\n")
output_format = "\t".join(field for field in field_formats) + "\n"
elif output_format == "flat":
output_format = "\n".join(header + ": " + field for (header, field) in zip(headers, field_formats)) + "\n"
else:
raise AssertionError(output_format)
for target_idx, (target_id, target) in enumerate(targets, 1):
if target_id is None:
target_id = "MissingID-" + str(target_idx)
# Actually do the MCS search
match = rdFMCS.FindMCS([query, target], **mcs_kwargs)
# Report the results
if match is None:
# no match
fields = (
target_id,
target.GetNumAtoms(), target.GetNumBonds(),
0, 0,
0.0, 0.0,
0.0, 0.0)
else:
# describe the match
target_num_atoms = target.GetNumAtoms()
target_num_bonds = target.GetNumBonds()
match_numAtoms = match.numAtoms
match_numBonds = match.numBonds
fields = (
target_id,
target_num_atoms, target_num_bonds,
match_numAtoms, match_numBonds,
match_numAtoms / query_num_atoms,
match_numBonds / query_num_bonds,
match_numAtoms / (query_num_atoms + target_num_atoms - match_numAtoms),
match_numBonds / (query_num_bonds + target_num_bonds - match_numBonds),
)
outfile.write(output_format % fields)
finally:
if outfile_close is not None:
outfile_close()
if targets_close is not None:
targets_close()
if __name__ == "__main__":
main()
_______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

