On Nov 19, 2020, at 17:48, Gustavo Seabra <gustavo.sea...@gmail.com> wrote: Is it possible to search for *partial* substructure matches using RDKit?... For example, if the pattern is a naphthalene and the molecule to A number of people pointed out that RDKit's MCS feature might be appropriate. I've attached an example program based around that. For example, the default is your two structures: % python mcs_search.py No --query specified, using naphthalene as the default. No --target or --targets specified, using phenol as the default. Target_ID: phenol nAtoms: 7 nBonds: 7 match_nAtoms: 6 match_nBonds: 6 atom_overlap: 0.600 bond_overlap: 0.545 atom_Tanimoto: 0.545 bond_Tanimoto: 0.500 I'll reverse it by specifying the SMILES on the command-line. % python mcs_search.py --query 'c1ccccc1O' --target 'c1ccc2ccccc2c1' Target_ID: query nAtoms: 10 nBonds: 11 match_nAtoms: 6 match_nBonds: 6 atom_overlap: 0.857 bond_overlap: 0.857 atom_Tanimoto: 0.545 bond_Tanimoto: 0.500 The program includes options to configure the FindMCS() parameters. In addition, if chemfp 3.x is installed then some additional features are available, like the following example, which applies the MCS search to all records in ChEBI: % python mcs_search.py --query 'COC(=O)C1C(OC(=O)c2ccccc2)CC2CCC1N2C' --targets ~/databases/ChEBI_lite.sdf.gz --id-tag 'ChEBI ID' Target_ID nAtoms nBonds match_nAtoms match_nBonds atom_overlap bond_overlap atom_Tanimoto bond_Tanimoto CHEBI:776 21 24 9 8 0.409 0.333 0.265 0.200 CHEBI:1148 7 6 6 5 0.273 0.208 0.261 0.200 CHEBI:1734 19 21 16 15 0.727 0.625 0.640 0.500 CHEBI:1895 9 9 9 8 0.409 0.333 0.409 0.320 ... On Nov 20, 2020, at 15:56, Gustavo Seabra <gustavo.sea...@gmail.com> wrote: Andrew da...@dalkescientific.com |
# Example of how to use RDKit's MCS search to find the MCS similarity # between two compounds.
EPILOG = """ The default --query is naphthalene. The default --target is phenol. The default output format when '--targets' is not specified is 'flat'. % python mcs_search.py No --query specified, using naphthalene as the default. No --target or --targets specified, using phenol as the default. Target_ID: phenol nAtoms: 7 nBonds: 7 match_nAtoms: 6 match_nBonds: 6 atom_overlap: 0.600 bond_overlap: 0.545 atom_Tanimoto: 0.545 bond_Tanimoto: 0.500 The first two lines of the above, both starting "No --", are written to stderr. The other lines were written to --output/-o, which default to stdout. The output lines are: nAtoms -- number of atoms in the target structure nBonds -- number of bonds in the target structure match_nAtoms -- number of atoms in the MCS match match_nBonds -- number of bonds in the MCS match atom_overlap -- number of atoms in the MCS match / number of atoms in the query structure bond_overlap -- number of bonds in the MCS match / number of bonds in the query structure atom_Tanimoto -- number of atoms in the MCS match / (query num atoms + target num atoms - match num atoms) bond_Tanimoto -- number of bonds in the MCS match / (query num bonds + target num bonds - match num bond) Use "--out tsv" to write the output as tab-separated fields, with a header column. By default, the --query and --target are interpreted as SMILES strings: % python mcs_search.py --query CCCO --target NOC Target_ID: query nAtoms: 3 nBonds: 2 match_nAtoms: 2 match_nBonds: 1 atom_overlap: 0.500 bond_overlap: 0.333 atom_Tanimoto: 0.400 bond_Tanimoto: 0.250 Use --query-format and --target-format to specify an alternative format. NOTE! This feature requires that chemfp 3.x or later be installed % python mcs_search.py --query 'InChI=1S/C7H8/c1-7-5-3-2-4-6-7/h2-6H,1H3' --query-format inchi No --target or --targets specified, using phenol as the default. Target_ID: phenol nAtoms: 7 nBonds: 7 match_nAtoms: 6 match_nBonds: 6 atom_overlap: 0.857 bond_overlap: 0.857 atom_Tanimoto: 0.750 bond_Tanimoto: 0.750 Use --targets to process all of the structures in a named file. NOTE! This feature requires that chemfp 3.x or later be installed % python mcs_search.py --query 'COC(=O)C1C(OC(=O)c2ccccc2)CC2CCC1N2C' --targets ~/databases/chebi.smi Target_ID nAtoms nBonds match_nAtoms match_nBonds atom_overlap bond_overlap atom_Tanimoto bond_Tanimoto CHEBI:90 21 23 17 17 0.773 0.708 0.654 0.567 CHEBI:165 11 12 9 8 0.409 0.333 0.375 0.286 CHEBI:598 10 9 6 5 0.273 0.208 0.231 0.179 CHEBI:776 21 24 9 8 0.409 0.333 0.265 0.200 ... The command-line options support a few o the chemfp reader configurations. For example, the following uses the 'ChEBI ID' tag of the records in the gzip-ed SDF to get the record name. % python mcs_search.py --query 'COC(=O)C1C(OC(=O)c2ccccc2)CC2CCC1N2C' --targets ~/databases/ChEBI_lite.sdf.gz --id-tag 'ChEBI ID' Target_ID nAtoms nBonds match_nAtoms match_nBonds atom_overlap bond_overlap atom_Tanimoto bond_Tanimoto CHEBI:776 21 24 9 8 0.409 0.333 0.265 0.200 CHEBI:1148 7 6 6 5 0.273 0.208 0.261 0.200 CHEBI:1734 19 21 16 15 0.727 0.625 0.640 0.500 CHEBI:1895 9 9 9 8 0.409 0.333 0.409 0.320 ... Finally, nearly all of the MCS parameters can be configured on the command-line. This program was written by Andrew Dalke <da...@dalkescientific.com> . """ import sys import argparse from rdkit import Chem from rdkit.Chem import rdFMCS def get_chemfp_rdkit_toolkit(option): try: import chemfp except ImportError: sys.stderr.write("chemfp must be installed to use the %s argument.\n" % (option,)) sys.stderr.write("Install it with:\n") sys.stderr.write(" python -m pip install chemfp -i https://chemfp.com/packages\n") raise SystemExit(10) if not hasattr(chemfp, "__version_info__") or chemfp.__version_info__[0] < 3: sys.stderr.write("A newer version of chemfp must be installed to use the %s argument." % (option,)) sys.stderr.write("Upgrade your version with:\n") sys.stderr.write(" python -m pip install chemfp -i https://chemfp.com/packages --upgrade\n") raise SystemExit(10) from chemfp import rdkit_toolkit return rdkit_toolkit parser = argparse.ArgumentParser( description="Compute MCS similarity scores between a query and target structure or structures", epilog = EPILOG, formatter_class=argparse.RawDescriptionHelpFormatter, ) # Query / target parser.add_argument("--query", metavar="SMILES", help="default query SMILES (default: naphthalene)") target_group = parser.add_mutually_exclusive_group() target_group.add_argument("--target", metavar="SMILES", help="default target SMILES (default: phenol)") target_group.add_argument("--targets", metavar="FILENAME", help="read target structures from FILENAME; use '-' for stdin") # MCS arguments parser.add_argument("--maximize-atoms", action="store_true", help="maximize the number of atoms (the default maximizes the number of atoms") parser.add_argument("--match-valences", action="store_true", help="atom matches must have the same valence") parser.add_argument("--ring-matches-ring-only", action="store_true", help="ring atoms must only match ring atoms") parser.add_argument("--complete-rings-only", action="store_true", help="only allow complete rings to match") parser.add_argument("--match-chiral-tag", action="store_true", help="atom matches must have the same chiral tag") parser.add_argument("--atom-compare", choices=("Any", "AnyHeavyAtom", "Elements", "Isotopes"), default="Elements", help="method for comparing two atoms (default: Elements)") parser.add_argument("--bond-compare", choices=("Any", "Order", "OrderExact"), default="Order", help="method for comparing two bonds (default: Order") parser.add_argument("--ring-compare", choices=("IgnoreRingFusion", "PermissiveRingFusion", "StrictRingFusion"), default="IgnoreRingFusion", help="method for handling ring fusion (default: 'IgnoreRingFusion'") _atom_comparisons = { "Any": rdFMCS.AtomCompare.CompareAny, "AnyHeavyAtom": rdFMCS.AtomCompare.CompareAnyHeavyAtom, "Elements": rdFMCS.AtomCompare.CompareElements, "Isotopes": rdFMCS.AtomCompare.CompareIsotopes, } _bond_comparisons = { "Any": rdFMCS.BondCompare.CompareAny, "Order": rdFMCS.BondCompare.CompareOrder, "OrderExact": rdFMCS.BondCompare.CompareOrderExact, } _ring_comparisons = { "IgnoreRingFusion": rdFMCS.RingCompare.IgnoreRingFusion, "PermissiveRingFusion": rdFMCS.RingCompare.PermissiveRingFusion, "StrictRingFusion": rdFMCS.RingCompare.StrictRingFusion, } # Configure how to parse the query and target(s) parser.add_argument("--query-format", metavar="FORMAT", help="query format (default is 'smi' for SMILES)") parser.add_argument("--target-format", metavar="FORMAT", help="target format (default: based on filename extension, defaults to SMILES") parser.add_argument("--id-tag", metavar="STR", help="tag containing the identifier, use if --target is an SDF") parser.add_argument("--has-header", action="store_true", help="if targets is a SMILES file, the first line contains the header") parser.add_argument("--delimiter", choices=("to-eol", "tab", "whitespace", "space"), help="if targets is a SMILES file, the delimiter style (default: 'to-eol'") # Configure the output parser.add_argument("-o", "--output", metavar="FILENAME", help="write output to FILENAME") parser.add_argument("--out", choices=("flat", "tsv"), help="output style (default is 'flat' for --target and 'tsv' for --targets)") def main(): args = parser.parse_args() mcs_kwargs = { "maximizeBonds": not args.maximize_atoms, "matchValences": args.match_valences, "ringMatchesRingOnly": args.ring_matches_ring_only, "completeRingsOnly": args.complete_rings_only, "matchChiralTag": args.match_chiral_tag, "atomCompare": _atom_comparisons[args.atom_compare], "bondCompare": _bond_comparisons[args.bond_compare], "ringCompare": _ring_comparisons[args.ring_compare], } if args.query is None: sys.stderr.write("No --query specified, using naphthalene as the default.\n") query = Chem.MolFromSmiles("c1ccc2ccccc2c1") # naphthalene is the default query else: if args.query_format is None: query = Chem.MolFromSmiles(args.query) if query is None: parser.error("Unable to parse --query as a SMILES string") else: # Use chemfp to process the --query T = get_chemfp_rdkit_toolkit("--query") try: fmt = T.get_input_format(args.query_format) except ValueError as err: parser.error("Cannot use --query-format %s: %s" % ( args.query_format, err)) try: query = T.parse_molecule(args.query, fmt) except ValueError as err: parser.error("Cannot parse --query as a %s record: %s" % ( args.query_format, err)) query_num_atoms = query.GetNumAtoms() query_num_bonds = query.GetNumBonds() headers = [ "Target_ID", "nAtoms", "nBonds", "match_nAtoms", "match_nBonds", "atom_overlap", "bond_overlap", "atom_Tanimoto", "bond_Tanimoto"] field_formats = [ "%s", "%d", "%d", "%d", "%d", "%.3f", "%.3f", "%.3f", "%.3f"] assert len(headers) == len(field_formats) output_format = args.out if args.targets is None: if output_format is None: output_format = "flat" targets_close = None # Use a single target if args.target is None: sys.stderr.write("No --target or --targets specified, using phenol as the default.\n") target = Chem.MolFromSmiles("c1ccccc1O") # phenol is the default target targets = [("phenol", target)] elif args.target_format is None: target = Chem.MolFromSmiles(args.target) if target is None: parser.error("Unable to parse --target as a SMILES string") targets = [("query", target)] else: # Use chemfp to process the --target T = get_chemfp_rdkit_toolkit("--target-format") try: fmt = T.get_input_format(args.target_format) except ValueError as err: parser.error("Cannot use --target-format %s: %s" % ( args.query_format, err)) try: target = T.parse_molecule(args.target, fmt, id_tag=args.id_tag) except ValueError as err: parser.error("Cannot parse --target as a %s record: %s" % ( args.target_format, err)) targets = [("query", target)] else: if output_format is None: output_format = "tsv" # Read from the named file # Use chemfp to process the --targets T = get_chemfp_rdkit_toolkit("--targets") try: fmt = T.get_input_format(args.target_format) except ValueError as err: parser.error("Cannot use --target-format %s: %s" % ( args.query_format, err)) if not fmt.supports_io: parser.error("chemfp does not support reading from --target-format %s files" % ( args.target_format,)) source = args.targets if source == "-": source = None try: targets = T.read_ids_and_molecules( source, format=args.target_format, id_tag=args.id_tag, reader_args={ "delimiter": args.delimiter, "has_header": args.has_header, }, errors="report") except IOError as err: parser.error("Cannot open --targets file: %s" % (err,)) targets_close = targets.close outfile_close = None try: # open the --output, or use stdout if args.output is None: outfile = sys.stdout else: try: outfile = open(args.output, "w") except IOError as err: parser.error("Cannot open --output file: %s" % (err,)) outfile_close = outfile.close if output_format == "tsv": outfile.write("\t".join(headers) + "\n") output_format = "\t".join(field for field in field_formats) + "\n" elif output_format == "flat": output_format = "\n".join(header + ": " + field for (header, field) in zip(headers, field_formats)) + "\n" else: raise AssertionError(output_format) for target_idx, (target_id, target) in enumerate(targets, 1): if target_id is None: target_id = "MissingID-" + str(target_idx) # Actually do the MCS search match = rdFMCS.FindMCS([query, target], **mcs_kwargs) # Report the results if match is None: # no match fields = ( target_id, target.GetNumAtoms(), target.GetNumBonds(), 0, 0, 0.0, 0.0, 0.0, 0.0) else: # describe the match target_num_atoms = target.GetNumAtoms() target_num_bonds = target.GetNumBonds() match_numAtoms = match.numAtoms match_numBonds = match.numBonds fields = ( target_id, target_num_atoms, target_num_bonds, match_numAtoms, match_numBonds, match_numAtoms / query_num_atoms, match_numBonds / query_num_bonds, match_numAtoms / (query_num_atoms + target_num_atoms - match_numAtoms), match_numBonds / (query_num_bonds + target_num_bonds - match_numBonds), ) outfile.write(output_format % fields) finally: if outfile_close is not None: outfile_close() if targets_close is not None: targets_close() if __name__ == "__main__": main()
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss