On Dec 13, 2012, at 9:18 AM, [email protected] wrote:
> Regarding the issues you mentioned
...
>
> - non-canonical SMARTS
> - duplicates are not filtered out
I think I figured out a way around that via some post-processing.
>> Or do you mean the number of molecules which contain that structure,
>> in which case "CC" exists in 3 of the structures.
Okay. You want a (say) 15% threshold, and the threshold MCS isn't
yet ported over to RDKit. It is in that hacked module I sent the
other day.
> Following up on the example from the above section:
> "
> CCNCc1ccccc1
> c1ccc(cc1)CN
> CCNCc1ccccc1
> c1cnc[nH]
> "
BTW, that last SMILES isn't complete.
> => I would be more than happy to finally have this output:
> "
> newflavorOfMCS frequency
> ##########################
> c1ccc(cc1)CN 3
> c1cnc[nH] 1
> "
That script is attached. Here's an example of it in use:
% python paul.py --min-num-bonds 7 paul.smi
[11:13:31] SMILES Parse Error: unclosed ring for input: c1cnc[nH]
Could not process SMILES c1cnc[nH] on line 4
frequency #atoms #bonds SMILES SMARTS
3 7 7 Cc1ccccc1 [#6]:1:[#6]:[#6]:[#6](-[#6]):[#6]:[#6]:1
2 8 7 ccc(CNC)cc [#6]-[#7]-[#6]-[#6](:[#6]:[#6]):[#6]:[#6]
3 8 8 NCc1ccccc1 [#7]-[#6]-[#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]:1
3 8 7 ccccccCN [#6](:[#6]:[#6]):[#6]:[#6]:[#6]-[#6]-[#7]
2 8 7 cccc(CNC)c [#6]-[#7]-[#6]-[#6](:[#6]):[#6]:[#6]:[#6]
3 8 7 ccccc(CN)c [#6](:[#6]:[#6]:[#6](-[#6]-[#7]):[#6]):[#6]
2 8 7 cccccCNC [#6]-[#7]-[#6]-[#6]:[#6]:[#6]:[#6]:[#6]
2 8 7 ccc(c)CNCC [#6]-[#6]-[#7]-[#6]-[#6](:[#6]):[#6]:[#6]
2 8 7 ccccCNCC [#6]-[#6]-[#7]-[#6]-[#6]:[#6]:[#6]:[#6]
3 8 7 cccc(CN)cc [#7]-[#6]-[#6](:[#6]:[#6]):[#6]:[#6]:[#6]
2 9 8 ccccc(CNC)c [#6]-[#7]-[#6]-[#6](:[#6]:[#6]:[#6]:[#6]):[#6]
2 9 8 ccc(CNCC)cc [#6]-[#6]-[#7]-[#6]-[#6](:[#6]:[#6]):[#6]:[#6]
2 9 8 cccc(CNC)cc [#6]-[#7]-[#6]-[#6](:[#6]:[#6]):[#6]:[#6]:[#6]
2 9 8 cccc(c)CNCC [#6]-[#6]-[#7]-[#6]-[#6](:[#6]:[#6]:[#6]):[#6]
2 9 9 CNCc1ccccc1 [#6]-[#7]-[#6]-[#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]:1
2 9 8 ccccccCNC [#6]-[#7]-[#6]-[#6]:[#6]:[#6]:[#6]:[#6]:[#6]
2 9 8 cccccCNCC [#6]-[#6]-[#7]-[#6]-[#6]:[#6]:[#6]:[#6]:[#6]
2 10 9 ccccc(c)CNCC [#6]-[#6]-[#7]-[#6]-[#6](:[#6]):[#6]:[#6]:[#6]:[#6]
2 10 9 cccc(cc)CNCC [#6]-[#6]-[#7]-[#6]-[#6](:[#6]:[#6]):[#6]:[#6]:[#6]
2 10 10 CCNCc1ccccc1 [#6]-[#6]-[#7]-[#6]-[#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]:1
2 10 9 ccccccCNCC [#6]-[#6]-[#7]-[#6]-[#6]:[#6]:[#6]:[#6]:[#6]:[#6]
You can see that 'ccccc(CNC)c' and 'ccc(CNCC)cc' are rotations
around the same ring. That's a limitation of the current approach.
A non-hacked solution could hook into the information about
"complete-ring-only". That would give you structures more like
what a chemist would expect, though not the scaffolds that Christos
and Peter suggested.
import sys
import re
from rdkit import Chem
import hacked_fmcs
import argparse
parser = argparse.ArgumentParser("Find fragments")
parser.add_argument("filename",
help="SMILES filename")
parser.add_argument("--threshold", "-t", type=float, default=0.5,
help="minimum match threshold (default: 0.5)")
parser.add_argument("--min-num-atoms", type=int, default=2,
help="minimum number of atom for the match (default: 2)")
parser.add_argument("--min-num-bonds", type=int, default=1,
help="minimum number of bonds for the match (default: 1)")
args = parser.parse_args()
if args.min_num_atoms < 2:
parser.error("--min-num-atoms must be at least 2")
if args.min_num_bonds < 1:
parser.error("--min-num-bonds must be at least 2")
if args.threshold <= 0.0:
parser.error("--threshold must be positive")
table = Chem.GetPeriodicTable()
_atomic_num_to_symbol = dict((str(i), table.GetElementSymbol(i)) for i in range(105))
atom_pat = re.compile("\[#(\d+)\]")
def atomic_num_to_symbol(m):
return _atomic_num_to_symbol[m.group(1)]
def smarts_to_unique_smiles_fragment(smarts):
fragment_smiles = atom_pat.sub(atomic_num_to_symbol, smarts)
mol = Chem.MolFromSmiles(fragment_smiles, sanitize=False)
# Correct the aromaticity information
for bond in mol.GetBonds():
if bond.GetIsAromatic():
bond.GetBeginAtom().SetIsAromatic(1)
bond.GetEndAtom().SetIsAromatic(1)
return Chem.MolToSmiles(mol)
# Read in the input file
mols = []
for lineno, line in enumerate(open(args.filename)):
fields = line.split()
if not fields:
sys.stderr.write("Missing SMILES on line %d" % (lineno+1,))
continue
smiles = fields[0]
mol = Chem.MolFromSmiles(smiles)
if mol is None:
sys.stderr.write("Could not process SMILES %s on line %d\n" % (smiles, lineno+1))
continue
mols.append(mol)
mcs_result = hacked_fmcs.fmcs(mols,
threshold=args.threshold,
min_num_atoms=args.min_num_atoms,
# fmcs doesn't implement min_num_bonds; use post-processing
maximize="all-matches")
# Convert to unique fragment SMILES
matches = []
unique_fragment_smiles = set()
for smarts, is_match in mcs_result.matches.items():
if not is_match: # remember, this is hacked code, and a bit ugly!
continue
fragment_smiles = smarts_to_unique_smiles_fragment(smarts)
if fragment_smiles in unique_fragment_smiles:
continue
unique_fragment_smiles.add(fragment_smiles)
pattern = Chem.MolFromSmarts(smarts)
if pattern.GetNumBonds() < args.min_num_bonds:
continue
matches.append( (pattern, smarts, fragment_smiles) )
def order_by_number_of_atoms(term):
return term[0].GetNumAtoms()
matches.sort(key=order_by_number_of_atoms)
print "frequency #atoms #bonds SMILES SMARTS"
for pattern, smarts, fragment_smiles in matches:
count = sum(1 for mol in mols if mol.HasSubstructMatch(pattern))
print count, pattern.GetNumAtoms(), pattern.GetNumBonds(), fragment_smiles, smarts
Andrew
[email protected]
------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss