Re: [Rdkit-discuss] one flavor of MCS

Andrew Dalke Thu, 13 Dec 2012 02:27:35 -0800

On Dec 13, 2012, at 9:18 AM, [email protected] wrote:
> Regarding the issues you mentioned
   ...
> 
> - non-canonical SMARTS
> - duplicates are not filtered out


I think I figured out a way around that via some post-processing.


>> Or do you mean the number of molecules which contain that structure,
>> in which case "CC" exists in 3 of the structures.

Okay. You want a (say) 15% threshold, and the threshold MCS isn't
yet ported over to RDKit. It is in that hacked module I sent the
other day.

> Following up on the example from the above section:
> "
> CCNCc1ccccc1
> c1ccc(cc1)CN
> CCNCc1ccccc1
> c1cnc[nH]
> "

BTW, that last SMILES isn't complete.

> => I would be more than happy to finally have this output:
> "
> newflavorOfMCS   frequency
> ##########################
> c1ccc(cc1)CN     3
> c1cnc[nH]        1
> "

That script is attached. Here's an example of it in use:

% python paul.py --min-num-bonds 7 paul.smi
[11:13:31] SMILES Parse Error: unclosed ring for input: c1cnc[nH]
Could not process SMILES c1cnc[nH] on line 4
frequency #atoms #bonds SMILES SMARTS
3 7 7 Cc1ccccc1 [#6]:1:[#6]:[#6]:[#6](-[#6]):[#6]:[#6]:1
2 8 7 ccc(CNC)cc [#6]-[#7]-[#6]-[#6](:[#6]:[#6]):[#6]:[#6]
3 8 8 NCc1ccccc1 [#7]-[#6]-[#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]:1
3 8 7 ccccccCN [#6](:[#6]:[#6]):[#6]:[#6]:[#6]-[#6]-[#7]
2 8 7 cccc(CNC)c [#6]-[#7]-[#6]-[#6](:[#6]):[#6]:[#6]:[#6]
3 8 7 ccccc(CN)c [#6](:[#6]:[#6]:[#6](-[#6]-[#7]):[#6]):[#6]
2 8 7 cccccCNC [#6]-[#7]-[#6]-[#6]:[#6]:[#6]:[#6]:[#6]
2 8 7 ccc(c)CNCC [#6]-[#6]-[#7]-[#6]-[#6](:[#6]):[#6]:[#6]
2 8 7 ccccCNCC [#6]-[#6]-[#7]-[#6]-[#6]:[#6]:[#6]:[#6]
3 8 7 cccc(CN)cc [#7]-[#6]-[#6](:[#6]:[#6]):[#6]:[#6]:[#6]
2 9 8 ccccc(CNC)c [#6]-[#7]-[#6]-[#6](:[#6]:[#6]:[#6]:[#6]):[#6]
2 9 8 ccc(CNCC)cc [#6]-[#6]-[#7]-[#6]-[#6](:[#6]:[#6]):[#6]:[#6]
2 9 8 cccc(CNC)cc [#6]-[#7]-[#6]-[#6](:[#6]:[#6]):[#6]:[#6]:[#6]
2 9 8 cccc(c)CNCC [#6]-[#6]-[#7]-[#6]-[#6](:[#6]:[#6]:[#6]):[#6]
2 9 9 CNCc1ccccc1 [#6]-[#7]-[#6]-[#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]:1
2 9 8 ccccccCNC [#6]-[#7]-[#6]-[#6]:[#6]:[#6]:[#6]:[#6]:[#6]
2 9 8 cccccCNCC [#6]-[#6]-[#7]-[#6]-[#6]:[#6]:[#6]:[#6]:[#6]
2 10 9 ccccc(c)CNCC [#6]-[#6]-[#7]-[#6]-[#6](:[#6]):[#6]:[#6]:[#6]:[#6]
2 10 9 cccc(cc)CNCC [#6]-[#6]-[#7]-[#6]-[#6](:[#6]:[#6]):[#6]:[#6]:[#6]
2 10 10 CCNCc1ccccc1 [#6]-[#6]-[#7]-[#6]-[#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]:1
2 10 9 ccccccCNCC [#6]-[#6]-[#7]-[#6]-[#6]:[#6]:[#6]:[#6]:[#6]:[#6]


You can see that 'ccccc(CNC)c' and 'ccc(CNCC)cc' are rotations
around the same ring. That's a limitation of the current approach.

A non-hacked solution could hook into the information about
"complete-ring-only". That would give you structures more like
what a chemist would expect, though not the scaffolds that Christos
and Peter suggested.

import sys
import re
from rdkit import Chem
import hacked_fmcs

import argparse
parser = argparse.ArgumentParser("Find fragments")
parser.add_argument("filename",
                    help="SMILES filename")
parser.add_argument("--threshold", "-t", type=float, default=0.5,
                    help="minimum match threshold (default: 0.5)")
parser.add_argument("--min-num-atoms", type=int, default=2,
                    help="minimum number of atom for the match (default: 2)")
parser.add_argument("--min-num-bonds", type=int, default=1,
                    help="minimum number of bonds for the match (default: 1)")

args = parser.parse_args()
if args.min_num_atoms < 2:
    parser.error("--min-num-atoms must be at least 2")
if args.min_num_bonds < 1:
    parser.error("--min-num-bonds must be at least 2")
if args.threshold <= 0.0:
    parser.error("--threshold must be positive")

table = Chem.GetPeriodicTable()
_atomic_num_to_symbol = dict((str(i), table.GetElementSymbol(i)) for i in range(105))
atom_pat = re.compile("\[#(\d+)\]")
def atomic_num_to_symbol(m):
    return _atomic_num_to_symbol[m.group(1)]
    
def smarts_to_unique_smiles_fragment(smarts):
    fragment_smiles = atom_pat.sub(atomic_num_to_symbol, smarts)
    mol = Chem.MolFromSmiles(fragment_smiles, sanitize=False)
    # Correct the aromaticity information
    for bond in mol.GetBonds():
        if bond.GetIsAromatic():
            bond.GetBeginAtom().SetIsAromatic(1)
            bond.GetEndAtom().SetIsAromatic(1)
    return Chem.MolToSmiles(mol)

# Read in the input file
mols = []
for lineno, line in enumerate(open(args.filename)):
    fields = line.split()
    if not fields:
        sys.stderr.write("Missing SMILES on line %d" % (lineno+1,))
        continue
    smiles = fields[0]
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        sys.stderr.write("Could not process SMILES %s on line %d\n" % (smiles, lineno+1))
        continue
    mols.append(mol)

mcs_result = hacked_fmcs.fmcs(mols,
                              threshold=args.threshold,
                              min_num_atoms=args.min_num_atoms,
                              # fmcs doesn't implement min_num_bonds; use post-processing
                              maximize="all-matches")

# Convert to unique fragment SMILES
matches = []
unique_fragment_smiles = set()
for smarts, is_match in mcs_result.matches.items():
    if not is_match:  # remember, this is hacked code, and a bit ugly!
        continue

    fragment_smiles = smarts_to_unique_smiles_fragment(smarts)
    if fragment_smiles in unique_fragment_smiles:
        continue
    unique_fragment_smiles.add(fragment_smiles)
    pattern = Chem.MolFromSmarts(smarts)
    if pattern.GetNumBonds() < args.min_num_bonds:
        continue
    matches.append( (pattern, smarts, fragment_smiles) )

def order_by_number_of_atoms(term):
    return term[0].GetNumAtoms()
    
matches.sort(key=order_by_number_of_atoms)


print "frequency #atoms #bonds SMILES SMARTS"
for pattern, smarts, fragment_smiles in matches:
    count = sum(1 for mol in mols if mol.HasSubstructMatch(pattern))
    print count, pattern.GetNumAtoms(), pattern.GetNumBonds(), fragment_smiles, smarts





                                Andrew
                                [email protected]

------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] one flavor of MCS

Reply via email to