Re: [Rdkit-discuss] Partial substructure match?

Andrew Dalke Mon, 23 Nov 2020 05:09:52 -0800

On Nov 19, 2020, at 17:48, Gustavo Seabra <gustavo.sea...@gmail.com> wrote:

Is it possible to search for *partial* substructure matches using RDKit?

...

For example, if the pattern is a naphthalene and the molecule to
search has a benzene, that would count as a 60% match.

A number of people pointed out that RDKit's MCS feature might be appropriate.

I've attached an example program based around that.

For example, the default is your two structures:

% python mcs_search.py
No --query specified, using naphthalene as the default.
No --target or --targets specified, using phenol as the default.
Target_ID: phenol
nAtoms: 7
nBonds: 7
match_nAtoms: 6
match_nBonds: 6
atom_overlap: 0.600
bond_overlap: 0.545
atom_Tanimoto: 0.545
bond_Tanimoto: 0.500

I'll reverse it by specifying the SMILES on the command-line.

% python mcs_search.py --query 'c1ccccc1O' --target 'c1ccc2ccccc2c1'
Target_ID: query
nAtoms: 10
nBonds: 11
match_nAtoms: 6
match_nBonds: 6
atom_overlap: 0.857
bond_overlap: 0.857
atom_Tanimoto: 0.545
bond_Tanimoto: 0.500

The program includes options to configure the FindMCS() parameters.

In addition, if chemfp 3.x is installed then some additional features are available, like the following example, which applies the MCS search to all records in ChEBI:

% python mcs_search.py --query 'COC(=O)C1C(OC(=O)c2ccccc2)CC2CCC1N2C' --targets ~/databases/ChEBI_lite.sdf.gz --id-tag 'ChEBI ID'
Target_ID nAtoms nBonds match_nAtoms match_nBonds atom_overlap bond_overlap atom_Tanimoto bond_Tanimoto
CHEBI:776 21 24 9 8 0.409 0.333 0.265 0.200
CHEBI:1148 7 6 6 5 0.273 0.208 0.261 0.200
CHEBI:1734 19 21 16 15 0.727 0.625 0.640 0.500
CHEBI:1895 9 9 9 8 0.409 0.333 0.409 0.320
...

On Nov 20, 2020, at 15:56, Gustavo Seabra <gustavo.sea...@gmail.com> wrote:

Is it possible to get a partial match with substructure search?

No.

Andrew
da...@dalkescientific.com

# Example of how to use RDKit's MCS search to find the MCS similarity
# between two compounds.


EPILOG = """
The default --query is naphthalene.
The default --target is phenol.
The default output format when '--targets' is not specified is 'flat'.

 % python mcs_search.py
 No --query specified, using naphthalene as the default.
 No --target or --targets specified, using phenol as the default.
 Target_ID: phenol
 nAtoms: 7
 nBonds: 7
 match_nAtoms: 6
 match_nBonds: 6
 atom_overlap: 0.600
 bond_overlap: 0.545
 atom_Tanimoto: 0.545
 bond_Tanimoto: 0.500

The first two lines of the above, both starting "No --", are written
to stderr. The other lines were written to --output/-o, which default
to stdout.

The output lines are:
 nAtoms -- number of atoms in the target structure
 nBonds -- number of bonds in the target structure
 match_nAtoms -- number of atoms in the MCS match
 match_nBonds -- number of bonds in the MCS match
 atom_overlap -- number of atoms in the MCS match /
                 number of atoms in the query structure
 bond_overlap -- number of bonds in the MCS match /
                 number of bonds in the query structure
 atom_Tanimoto -- number of atoms in the MCS match /
       (query num atoms + target num atoms - match num atoms)
 bond_Tanimoto  -- number of bonds in the MCS match /
       (query num bonds + target num bonds - match num bond)

Use "--out tsv" to write the output as tab-separated fields, with a
header column.

By default, the --query and --target are interpreted as SMILES strings:

 % python mcs_search.py --query CCCO --target NOC
 Target_ID: query
 nAtoms: 3
 nBonds: 2
 match_nAtoms: 2
 match_nBonds: 1
 atom_overlap: 0.500
 bond_overlap: 0.333
 atom_Tanimoto: 0.400
 bond_Tanimoto: 0.250

Use --query-format and --target-format to specify an alternative format.

 NOTE! This feature requires that chemfp 3.x or later be installed

 % python mcs_search.py --query 'InChI=1S/C7H8/c1-7-5-3-2-4-6-7/h2-6H,1H3' --query-format inchi
 No --target or --targets specified, using phenol as the default.
 Target_ID: phenol
 nAtoms: 7
 nBonds: 7
 match_nAtoms: 6
 match_nBonds: 6
 atom_overlap: 0.857
 bond_overlap: 0.857
 atom_Tanimoto: 0.750
 bond_Tanimoto: 0.750

Use --targets to process all of the structures in a named file.

  NOTE! This feature requires that chemfp 3.x or later be installed

 % python mcs_search.py --query 'COC(=O)C1C(OC(=O)c2ccccc2)CC2CCC1N2C' --targets ~/databases/chebi.smi
 Target_ID	nAtoms	nBonds	match_nAtoms	match_nBonds	atom_overlap	bond_overlap	atom_Tanimoto	bond_Tanimoto
 CHEBI:90	21	23	17	17	0.773	0.708	0.654	0.567
 CHEBI:165	11	12	9	8	0.409	0.333	0.375	0.286
 CHEBI:598	10	9	6	5	0.273	0.208	0.231	0.179
 CHEBI:776	21	24	9	8	0.409	0.333	0.265	0.200
 ...

The command-line options support a few o the chemfp reader configurations.
For example, the following uses the 'ChEBI ID' tag of the records
in the gzip-ed SDF to get the record name.
 
 % python mcs_search.py --query 'COC(=O)C1C(OC(=O)c2ccccc2)CC2CCC1N2C' --targets ~/databases/ChEBI_lite.sdf.gz --id-tag 'ChEBI ID'
 Target_ID	nAtoms	nBonds	match_nAtoms	match_nBonds	atom_overlap	bond_overlap	atom_Tanimoto	bond_Tanimoto
 CHEBI:776	21	24	9	8	0.409	0.333	0.265	0.200
 CHEBI:1148	7	6	6	5	0.273	0.208	0.261	0.200
 CHEBI:1734	19	21	16	15	0.727	0.625	0.640	0.500
 CHEBI:1895	9	9	9	8	0.409	0.333	0.409	0.320
 ...

Finally, nearly all of the MCS parameters can be configured on the command-line.

This program was written by Andrew Dalke <da...@dalkescientific.com> .
"""

import sys
import argparse

from rdkit import Chem
from rdkit.Chem import rdFMCS

def get_chemfp_rdkit_toolkit(option):
    try:
        import chemfp
    except ImportError:
        sys.stderr.write("chemfp must be installed to use the %s argument.\n" % (option,))
        sys.stderr.write("Install it with:\n")
        sys.stderr.write("  python -m pip install chemfp -i https://chemfp.com/packages\n";)
        raise SystemExit(10)
    if not hasattr(chemfp, "__version_info__") or chemfp.__version_info__[0] < 3:
        sys.stderr.write("A newer version of chemfp must be installed to use the %s argument." % (option,))
        sys.stderr.write("Upgrade your version with:\n")
        sys.stderr.write("  python -m pip install chemfp -i https://chemfp.com/packages --upgrade\n")
        raise SystemExit(10)
    from chemfp import rdkit_toolkit
    return rdkit_toolkit

    

parser = argparse.ArgumentParser(
    description="Compute MCS similarity scores between a query and target structure or structures",
    epilog = EPILOG,
    formatter_class=argparse.RawDescriptionHelpFormatter,
    )

# Query / target
parser.add_argument("--query", metavar="SMILES",
                        help="default query SMILES (default: naphthalene)")
target_group = parser.add_mutually_exclusive_group()
target_group.add_argument("--target", metavar="SMILES",
                    help="default target SMILES (default: phenol)")
target_group.add_argument("--targets", metavar="FILENAME",
                    help="read target structures from FILENAME; use '-' for stdin")

# MCS arguments
parser.add_argument("--maximize-atoms", action="store_true",
                        help="maximize the number of atoms (the default maximizes the number of atoms")
parser.add_argument("--match-valences", action="store_true",
                        help="atom matches must have the same valence")
parser.add_argument("--ring-matches-ring-only", action="store_true",
                        help="ring atoms must only match ring atoms")
parser.add_argument("--complete-rings-only", action="store_true",
                        help="only allow complete rings to match")
parser.add_argument("--match-chiral-tag", action="store_true",
                        help="atom matches must have the same chiral tag")
parser.add_argument("--atom-compare", choices=("Any", "AnyHeavyAtom", "Elements", "Isotopes"),
                        default="Elements",
                        help="method for comparing two atoms (default: Elements)")
parser.add_argument("--bond-compare", choices=("Any", "Order", "OrderExact"),
                        default="Order",
                        help="method for comparing two bonds (default: Order")
parser.add_argument("--ring-compare", choices=("IgnoreRingFusion", "PermissiveRingFusion", "StrictRingFusion"),
                        default="IgnoreRingFusion",
                        help="method for handling ring fusion (default: 'IgnoreRingFusion'")

_atom_comparisons = {
    "Any": rdFMCS.AtomCompare.CompareAny,
    "AnyHeavyAtom": rdFMCS.AtomCompare.CompareAnyHeavyAtom,
    "Elements": rdFMCS.AtomCompare.CompareElements,
    "Isotopes": rdFMCS.AtomCompare.CompareIsotopes,
    }
_bond_comparisons = {
    "Any": rdFMCS.BondCompare.CompareAny,
    "Order": rdFMCS.BondCompare.CompareOrder,
    "OrderExact": rdFMCS.BondCompare.CompareOrderExact,
    }
_ring_comparisons = {
    "IgnoreRingFusion": rdFMCS.RingCompare.IgnoreRingFusion,
    "PermissiveRingFusion": rdFMCS.RingCompare.PermissiveRingFusion,
    "StrictRingFusion": rdFMCS.RingCompare.StrictRingFusion,
    }

# Configure how to parse the query and target(s)
parser.add_argument("--query-format", metavar="FORMAT",
                    help="query format (default is 'smi' for SMILES)")
parser.add_argument("--target-format", metavar="FORMAT",
                    help="target format (default: based on filename extension, defaults to SMILES")
parser.add_argument("--id-tag", metavar="STR",
                        help="tag containing the identifier, use if --target is an SDF")
parser.add_argument("--has-header", action="store_true",
                        help="if targets is a SMILES file, the first line contains the header")
parser.add_argument("--delimiter", choices=("to-eol", "tab", "whitespace", "space"),
                        help="if targets is a SMILES file, the delimiter style (default: 'to-eol'")

# Configure the output
parser.add_argument("-o", "--output", metavar="FILENAME",
                        help="write output to FILENAME")
parser.add_argument("--out", choices=("flat", "tsv"),
                        help="output style (default is 'flat' for --target and 'tsv' for --targets)")

def main():
    args = parser.parse_args()

    mcs_kwargs = {
        "maximizeBonds": not args.maximize_atoms,
        "matchValences": args.match_valences,
        "ringMatchesRingOnly": args.ring_matches_ring_only,
        "completeRingsOnly": args.complete_rings_only,
        "matchChiralTag": args.match_chiral_tag,
        "atomCompare": _atom_comparisons[args.atom_compare],
        "bondCompare": _bond_comparisons[args.bond_compare],
        "ringCompare": _ring_comparisons[args.ring_compare],
        }
    
    if args.query is None:
        sys.stderr.write("No --query specified, using naphthalene as the default.\n")
        query = Chem.MolFromSmiles("c1ccc2ccccc2c1") # naphthalene is the default query
    else:
        if args.query_format is None:
            query = Chem.MolFromSmiles(args.query)
            if query is None:
                parser.error("Unable to parse --query as a SMILES string")
        else:
            # Use chemfp to process the --query
            T = get_chemfp_rdkit_toolkit("--query")
            try:
                fmt = T.get_input_format(args.query_format)
            except ValueError as err:
                parser.error("Cannot use --query-format %s: %s" % (
                    args.query_format, err))
            try:
                query = T.parse_molecule(args.query, fmt)
            except ValueError as err:
                parser.error("Cannot parse --query as a %s record: %s" % (
                    args.query_format, err))

    query_num_atoms = query.GetNumAtoms()
    query_num_bonds = query.GetNumBonds()

    headers = [
        "Target_ID", "nAtoms", "nBonds", "match_nAtoms", "match_nBonds",
        "atom_overlap", "bond_overlap", "atom_Tanimoto", "bond_Tanimoto"]
    field_formats = [
        "%s", "%d", "%d", "%d", "%d",
        "%.3f", "%.3f", "%.3f", "%.3f"]
    assert len(headers) == len(field_formats)
    
    output_format = args.out
    if args.targets is None:
        if output_format is None:
            output_format = "flat"
        targets_close = None
        
        # Use a single target
        if args.target is None:
            sys.stderr.write("No --target or --targets specified, using phenol as the default.\n")
            target = Chem.MolFromSmiles("c1ccccc1O") # phenol is the default target
            targets = [("phenol", target)]
            
        elif args.target_format is None:
            target = Chem.MolFromSmiles(args.target)
            if target is None:
                parser.error("Unable to parse --target as a SMILES string")
            targets = [("query", target)]
        else:
            # Use chemfp to process the --target
            T = get_chemfp_rdkit_toolkit("--target-format")
            try:
                fmt = T.get_input_format(args.target_format)
            except ValueError as err:
                parser.error("Cannot use --target-format %s: %s" % (
                    args.query_format, err))
            try:
                target = T.parse_molecule(args.target, fmt, id_tag=args.id_tag)
            except ValueError as err:
                parser.error("Cannot parse --target as a %s record: %s" % (
                    args.target_format, err))
            targets = [("query", target)]
    
    else:
        if output_format is None:
            output_format = "tsv"
        # Read from the named file
        # Use chemfp to process the --targets
        T = get_chemfp_rdkit_toolkit("--targets")
        try:
            fmt = T.get_input_format(args.target_format)
        except ValueError as err:
            parser.error("Cannot use --target-format %s: %s" % (
                args.query_format, err))
        if not fmt.supports_io:
            parser.error("chemfp does not support reading from --target-format %s files" % (
                args.target_format,))
        source = args.targets
        if source == "-":
            source = None
        try:
            targets = T.read_ids_and_molecules(
                source, format=args.target_format,
                id_tag=args.id_tag,
                reader_args={
                    "delimiter": args.delimiter,
                    "has_header": args.has_header,
                    },
                errors="report")
        except IOError as err:
            parser.error("Cannot open --targets file: %s" % (err,))
        targets_close = targets.close

    outfile_close = None

    try:
        # open the --output, or use stdout
        if args.output is None:
            outfile = sys.stdout
        else:
            try:
                outfile = open(args.output, "w")
            except IOError as err:
                parser.error("Cannot open --output file: %s" % (err,))
            outfile_close = outfile.close

        if output_format == "tsv":
            outfile.write("\t".join(headers) + "\n")
            output_format = "\t".join(field for field in field_formats) + "\n"
        elif output_format == "flat":
            output_format = "\n".join(header + ": " + field for (header, field) in zip(headers, field_formats)) + "\n"
        else:
            raise AssertionError(output_format)

        for target_idx, (target_id, target) in enumerate(targets, 1):
            if target_id is None:
                target_id = "MissingID-" + str(target_idx)

            # Actually do the MCS search
            match = rdFMCS.FindMCS([query, target], **mcs_kwargs)
            
            # Report the results
            if match is None:
                # no match
                fields = (
                    target_id,
                    target.GetNumAtoms(), target.GetNumBonds(),
                    0, 0,
                    0.0, 0.0,
                    0.0, 0.0)
            else:
                # describe the match
                target_num_atoms = target.GetNumAtoms()
                target_num_bonds = target.GetNumBonds()
                match_numAtoms = match.numAtoms
                match_numBonds = match.numBonds
                fields = (
                    target_id,
                    target_num_atoms, target_num_bonds,
                    match_numAtoms, match_numBonds,
                    match_numAtoms / query_num_atoms,
                    match_numBonds / query_num_bonds,
                    match_numAtoms / (query_num_atoms + target_num_atoms - match_numAtoms),
                    match_numBonds / (query_num_bonds + target_num_bonds - match_numBonds),
                    )
            outfile.write(output_format % fields)
        
    finally:
        if outfile_close is not None:
            outfile_close()
        if targets_close is not None:
            targets_close()
if __name__ == "__main__":
    main()

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Partial substructure match?

Reply via email to