I have updated my code, fixed bugs, cleaned things up, improved performance and memory use. The biggest change was during path selection, for the path used to partition all of the clusters. I now have a much stronger bias towards selecting small paths, rather than the first one I find. This tends to do a better job of splitting other groups as well.
I generated a set of SMARTS-based non-intersecting linear substructure keys for a set of PubChem molecules (CIDs between 9425001 and 9450000, with 18,327 molecules in it). I used my "linear_hash.py" algorithm (attached) to find 256 partition paths. This took about 12 minutes and 500 MB. I used the resulting paths in "evaluate_hash_fingerprint.py" (attached) to evaluate the effectiveness of the fingerprints using Greg's query and target data. The results are: First 16 bits filter 46% of the target set First 32 bits filter 65% First 64 bits filter 83% First 96 bits filter 86% First 128 bits filter 90% All 256 bits filter 90% This compares very well to Greg's results, where his layered fingerprints filtered out 91% of the structures using 1024 bits. The paths I used were based on target analysis. I think I can do better with query analysis. I want to generate all fragments for a target data set then use a GA using a scoring function against a query data set in order to optimize screening effectiveness for different sized fingerprints. I don't have any time to do that this week. Here are the above mentioned attachments. While I used OEChem to generate the patterns, the results are simple SMARTS patterns that can be used by any toolkit. They are in "evaluate_hash_fingerprint.py".
# Evaluate the filter ability of a set of SMARTS-based fingerprints. # # The evaluation uses a set of substructure queries to filter # a target data set. The goal is to maximize the number of # compounds which are rejected by the queries. # The pattern definitions below come from running linear_hash.py # against PubChem compounds 9425001 to 9450000, set to # generate 256 paths. # # The query and target data sets come from Greg Landrum. # Note: they contain duplicates so the results aren't as # precise as the numbers I give. # I ran this code for different sized fingerprints, done by # selecting the first N patterns. Results are: # First 16 bits filter 46% of the target set # First 32 bits filter 65% # First 64 bits filter 83% # First 96 bits filter 86% # First 128 bits filter 90% # All 256 bits filter 90% # Looks like 128 bits is optimal. # I also tried some extra cycle patterns which replace # the last linear bits. # 16 bits => 51% # 32 bits => 68% # 64 bits => 83% # 96 bits => 86% # 128 bits => 88% # 256 bits => 90% # Looks like the cycles I chose weren't all that helpful. from openeye.oechem import * NUM_BITS = 256 USE_CYCLE_PATHS = 0 # output from linear_hash.py linear_hashes = """ 0: largest=18327, split using C-C-N-c (dist=29) 1: largest=9192, split using C-N-C=O (dist=6) 2: largest=5178, split using C-C-C (dist=24) 3: largest=2676, split using C-C-c (dist=36) 4: largest=1885, split using C-C-N-C-c (dist=3) 5: largest=1227, split using C-C (dist=7) 6: largest=1152, split using C-C-O (dist=2) 7: largest=857, split using C-O (dist=12) 8: largest=739, split using C-C-C-N-C=O (dist=2) 9: largest=651, split using C-N-N-C=O (dist=19) 10: largest=625, split using C-c~c~c (dist=16) 11: largest=544, split using O-c~c-O (dist=9) 12: largest=514, split using C-C-N-C-C (dist=19) 13: largest=318, split using O-C-C-O (dist=9) 14: largest=317, split using C=C-C-N-C-C-N (dist=1) 15: largest=301, split using C-c~c~c~s~c-C (dist=6) 16: largest=268, split using F (dist=14) 17: largest=170, split using N-c (dist=15) 18: largest=167, split using Cl-c~c~c~c (dist=0) 19: largest=128, split using C-F (dist=1) 20: largest=110, split using C-c~c~c-O (dist=3) 21: largest=109, split using C-c~c~c~c~c (dist=12) 22: largest=105, split using O-c~c-O-C-C=O (dist=2) 23: largest=100, split using C-c~c~c~c~c-C (dist=2) 24: largest=89, split using F-c~c~c-N (dist=0) 25: largest=86, split using C-C=O (dist=2) 26: largest=86, split using C-c~c~c~c-N (dist=21) 27: largest=85, split using O=C-N-C-C-N-c (dist=2) 28: largest=84, split using N-C-C-c (dist=1) 29: largest=73, split using C-C-C-C-C (dist=0) 30: largest=67, split using C-O-c~c~c~c-C (dist=2) 31: largest=64, split using C-c~c-N (dist=12) 32: largest=63, split using N-C-C-C-C-N (dist=0) 33: largest=50, split using N-c~c-O (dist=13) 34: largest=47, split using C=C-c~c~c~c~c (dist=2) 35: largest=44, split using C-C-O-c~c~c-C (dist=2) 36: largest=42, split using n (dist=1) 37: largest=37, split using C-C-C-C (dist=1) 38: largest=36, split using C-C-C-C-C-C-C (dist=0) 39: largest=36, split using S (dist=6) 40: largest=34, split using C=C-c~c~c~c-O (dist=1) 41: largest=32, split using N-c~c~c~c-O (dist=8) 42: largest=31, split using O-c~c~c-C=O (dist=1) 43: largest=26, split using Cl-c~c-N (dist=1) 44: largest=24, split using C-C-c~c~c~c-N (dist=2) 45: largest=24, split using Br (dist=4) 46: largest=22, split using c-c (dist=1) 47: largest=22, split using O=C-c (dist=1) 48: largest=20, split using C-C-n (dist=0) 49: largest=19, split using C-c~c~c-N (dist=2) 50: largest=18, split using O=S (dist=0) 51: largest=18, split using F-c~c-N (dist=1) 52: largest=18, split using C-C-C=O (dist=1) 53: largest=16, split using n~c~n (dist=0) 54: largest=16, split using C-c~c~c~c-C (dist=0) 55: largest=16, split using I (dist=2) 56: largest=16, split using N=O (dist=2) 57: largest=16, split using O=C-N-C-c (dist=2) 58: largest=15, split using C-c~c~c~c-Cl (dist=0) 59: largest=14, split using O-c~c~c-O (dist=1) 60: largest=14, split using c~c~c~c~c~c~c (dist=1) 61: largest=14, split using C-c~c~c~c-F (dist=3) 62: largest=13, split using C-C-C-N-C-C-C (dist=3) 63: largest=12, split using s (dist=0) 64: largest=12, split using C-C-C-C-N (dist=0) 65: largest=12, split using C-N-C-N (dist=2) 66: largest=12, split using Cl-c~c~c-N (dist=2) 67: largest=10, split using n~n (dist=1) 68: largest=10, split using C-c~n (dist=1) 69: largest=10, split using O-c~c~c~c-O (dist=1) 70: largest=10, split using O-c~c-C-N-C=O (dist=1) 71: largest=10, split using C-O-c~c~c-C=C (dist=1) 72: largest=10, split using N-C-C-C-C=O (dist=3) 73: largest=9, split using C-c~c~c-S-N (dist=0) 74: largest=9, split using C-C-C-C-C=O (dist=3) 75: largest=8, split using Cl (dist=0) 76: largest=8, split using C-n (dist=0) 77: largest=8, split using C-O-c (dist=0) 78: largest=8, split using C-c~c-c (dist=0) 79: largest=8, split using C-C-O-C (dist=0) 80: largest=8, split using C-c~c-F (dist=0) 81: largest=8, split using C-c~c-C (dist=0) 82: largest=8, split using O=C-N-c (dist=0) 83: largest=8, split using Cl-c~c~c-Cl (dist=0) 84: largest=8, split using C-S-c~c-N (dist=0) 85: largest=8, split using Br-c~c~c-C (dist=0) 86: largest=8, split using F-c~c~c-F (dist=0) 87: largest=8, split using N-c~c~c-S (dist=0) 88: largest=8, split using C-N-C-N-C (dist=0) 89: largest=8, split using C-C-c~c~c-C (dist=0) 90: largest=8, split using C-C-C-C-C-C (dist=0) 91: largest=8, split using C#N (dist=2) 92: largest=8, split using C=C (dist=2) 93: largest=8, split using O-C-O (dist=2) 94: largest=8, split using c-C-c (dist=2) 95: largest=8, split using C (dist=4) 96: largest=8, split using O (dist=4) 97: largest=8, split using c (dist=4) 98: largest=8, split using N (dist=4) 99: largest=8, split using C-c (dist=4) 100: largest=8, split using C-N (dist=4) 101: largest=8, split using C-S (dist=4) 102: largest=8, split using N-S (dist=4) 103: largest=8, split using C=O (dist=4) 104: largest=8, split using c~c (dist=4) 105: largest=8, split using c~c~c (dist=4) 106: largest=8, split using C-S=O (dist=4) 107: largest=8, split using C-c~c (dist=4) 108: largest=8, split using C-N-S (dist=4) 109: largest=8, split using O=S=O (dist=4) 110: largest=8, split using N-C=O (dist=4) 111: largest=8, split using C-C-N (dist=4) 112: largest=8, split using N-S=O (dist=4) 113: largest=8, split using S-C-c (dist=4) 114: largest=8, split using C-S-N (dist=4) 115: largest=8, split using C-N-C (dist=4) 116: largest=8, split using C-C-C-N (dist=4) 117: largest=8, split using N-C-C-N (dist=4) 118: largest=8, split using C-C-N-S (dist=4) 119: largest=8, split using N-S-C-c (dist=4) 120: largest=8, split using O=S-C-c (dist=4) 121: largest=8, split using c~c~c~c (dist=4) 122: largest=8, split using C-C-N-C (dist=4) 123: largest=8, split using C-N-S=O (dist=4) 124: largest=8, split using N-C-C=O (dist=4) 125: largest=8, split using S-C-c~c (dist=4) 126: largest=8, split using C-N-S-C (dist=4) 127: largest=8, split using C-C-C-N-C (dist=4) 128: largest=8, split using C-C-N-S-C (dist=4) 129: largest=8, split using C-N-C-C=O (dist=4) 130: largest=8, split using O=S-C-c~c (dist=4) 131: largest=8, split using C-c~c~c~c (dist=4) 132: largest=8, split using c~c~c~c~c (dist=4) 133: largest=8, split using N-C-C-N-S (dist=4) 134: largest=8, split using C-C-C-C=O (dist=4) 135: largest=8, split using S-C-c~c~c (dist=4) 136: largest=8, split using N-S-C-c~c (dist=4) 137: largest=8, split using C-C-N-C=O (dist=4) 138: largest=8, split using C-N-S-C-c (dist=4) 139: largest=8, split using C-N-C-C-N (dist=4) 140: largest=8, split using C-C-N-S=O (dist=4) 141: largest=8, split using C-S-N-C-C-N (dist=4) 142: largest=8, split using C-C-C-N-C-C (dist=4) 143: largest=8, split using C-C-N-C-C-N (dist=4) 144: largest=8, split using c~c~c~c~c~c (dist=4) 145: largest=8, split using N-C-C-N-S=O (dist=4) 146: largest=8, split using C-C-C-C-N-C (dist=4) 147: largest=8, split using C-C-N-C-C=O (dist=4) 148: largest=8, split using C-N-C-C-N-C (dist=4) 149: largest=8, split using C-N-C-C-N-S (dist=4) 150: largest=8, split using O=S-C-c~c~c (dist=4) 151: largest=8, split using C-C-N-S-C-c (dist=4) 152: largest=8, split using C-C-C-C-C-N (dist=4) 153: largest=8, split using N-S-C-c~c~c (dist=4) 154: largest=8, split using C-N-S-C-c~c (dist=4) 155: largest=8, split using S-C-c~c~c~c (dist=4) 156: largest=8, split using C-C-C-N-C-C-N (dist=4) 157: largest=8, split using C-N-C-C-C-C-N (dist=4) 158: largest=8, split using C-C-C-C-N-C-C (dist=4) 159: largest=8, split using C-N-S-C-c~c~c (dist=4) 160: largest=8, split using C-C-C-C-C-C-N (dist=4) 161: largest=8, split using C-C-C-C-N-C=O (dist=4) 162: largest=8, split using N-C-C-N-C-C=O (dist=4) 163: largest=8, split using N-S-C-c~c~c~c (dist=4) 164: largest=8, split using C-C-N-S-C-c~c (dist=4) 165: largest=8, split using N-C-C-N-S-C-c (dist=4) 166: largest=8, split using C-C-N-C-C-N-S (dist=4) 167: largest=8, split using C-N-C-C-N-S=O (dist=4) 168: largest=8, split using O=S-C-c~c~c~c (dist=4) 169: largest=8, split using C-C-C-C-C-N-C (dist=4) 170: largest=8, split using C-N-C-C-N-S-C (dist=4) 171: largest=8, split using C-N-C-C-C-C=O (dist=4) 172: largest=8, split using C-C-N-C-C-N-C (dist=4) 173: largest=8, split using C-c~c~c~c~c~c (dist=4) 174: largest=8, split using C-C-C-N-C-C=O (dist=4) 175: largest=8, split using S-C-c~c~c~c~c (dist=4) 176: largest=8, split using N-C-C-N-C-C-N (dist=4) 177: largest=7, split using Cl-c~c-Cl (dist=0) 178: largest=6, split using C-c~c-O (dist=0) 179: largest=6, split using C-C-O-C-C (dist=0) 180: largest=6, split using o (dist=1) 181: largest=6, split using Br-c~c-N (dist=1) 182: largest=6, split using C-C-O-c (dist=1) 183: largest=6, split using C-c~c~n (dist=1) 184: largest=6, split using F-c~c-F (dist=1) 185: largest=6, split using Cl-c~c-O (dist=1) 186: largest=6, split using N-c~c-N (dist=1) 187: largest=6, split using C-C-C-n (dist=1) 188: largest=6, split using I-c~c-N (dist=1) 189: largest=6, split using C-c~c~c-Cl (dist=1) 190: largest=6, split using c-C-N-C-c (dist=1) 191: largest=6, split using N-c~c~c-O (dist=1) 192: largest=6, split using C-O-c~c-O-C (dist=1) 193: largest=6, split using C-c~c~c-C=C (dist=1) 194: largest=6, split using C-C-O-c~c-O (dist=1) 195: largest=6, split using Cl-c~c~c~c-N (dist=1) 196: largest=6, split using C-C-c~c~c~c~c (dist=1) 197: largest=6, split using C-C-N-C-N-C=O (dist=1) 198: largest=6, split using C-C-c~c~c-C-C (dist=1) 199: largest=6, split using C-N-c~c~c~c-C (dist=1) 200: largest=6, split using Cl-c (dist=3) 201: largest=6, split using F-c (dist=3) 202: largest=6, split using O-c (dist=3) 203: largest=6, split using c~s (dist=3) 204: largest=6, split using F-C-F (dist=3) 205: largest=6, split using c~c~s (dist=3) 206: largest=6, split using F-C-c (dist=3) 207: largest=6, split using N-c~c (dist=3) 208: largest=6, split using C-S-C (dist=3) 209: largest=6, split using O-c~c (dist=3) 210: largest=6, split using N-C-c (dist=3) 211: largest=6, split using N-C-N (dist=3) 212: largest=6, split using C-c~s (dist=3) 213: largest=6, split using C-C=C (dist=3) 214: largest=6, split using C=C-c (dist=3) 215: largest=6, split using C-N-c (dist=3) 216: largest=6, split using C-C-S (dist=3) 217: largest=6, split using O-C-c (dist=3) 218: largest=6, split using O-C=O (dist=3) 219: largest=6, split using c~s~c (dist=3) 220: largest=6, split using F-c~c (dist=3) 221: largest=6, split using Cl-c~c (dist=3) 222: largest=6, split using C-N-C-c (dist=3) 223: largest=6, split using C-C=C-c (dist=3) 224: largest=6, split using C=C-C-c (dist=3) 225: largest=6, split using C-C-S=O (dist=3) 226: largest=6, split using C=C-c~c (dist=3) 227: largest=6, split using c~c~s~c (dist=3) 228: largest=6, split using O=C-c~c (dist=3) 229: largest=6, split using C-C-c~c (dist=3) 230: largest=6, split using O-C-c~c (dist=3) 231: largest=6, split using O-C-C=O (dist=3) 232: largest=6, split using c-C-O-c (dist=3) 233: largest=6, split using Cl-c~c~c (dist=3) 234: largest=6, split using C-N-c~c (dist=3) 235: largest=6, split using C=C-C=O (dist=3) 236: largest=6, split using C-C-c~s (dist=3) 237: largest=6, split using C-C-S-C (dist=3) 238: largest=6, split using F-c~c~c (dist=3) 239: largest=6, split using C-C-C-c (dist=3) 240: largest=6, split using N-C-C-S (dist=3) 241: largest=6, split using C-C-C-O (dist=3) 242: largest=6, split using C-C-C-S (dist=3) 243: largest=6, split using N-C-c~s (dist=3) 244: largest=6, split using N-C-c~c (dist=3) 245: largest=6, split using F-c~c-O (dist=3) 246: largest=6, split using N-c~c~c (dist=3) 247: largest=6, split using C-O-c~c (dist=3) 248: largest=6, split using O-c~c~c (dist=3) 249: largest=6, split using c~c~c~s (dist=3) 250: largest=6, split using F-C-c~c (dist=3) 251: largest=6, split using C-c~s~c (dist=3) 252: largest=6, split using C-C-C-C-S (dist=3) 253: largest=6, split using C-C-c~s~c (dist=3) 254: largest=6, split using C-C-C-c~s (dist=3) 255: largest=6, split using O=C-C=C-c (dist=3) """.splitlines() # remove blank lines linear_hashes = [x for x in linear_hashes if x] assert len(linear_hashes) == 256, len(linear_hashes) ### Use this to change the number of patterns in the fp assert 0 < NUM_BITS <= len(linear_hashes) linear_hashes = linear_hashes[:NUM_BITS] smarts_patterns = [line.split()[4] for line in linear_hashes] if USE_CYCLE_PATHS: # Feel free to tweak these as you see fit. # Simple cycles relatively easy to detect using # as a side-effect of doing linear hashing. extra_patterns = [ "*1**1", "*1***1", "*1****1", "*1*****1", "*1******1", "*1*******1", ] n = len(smarts_patterns) smarts_patterns[-len(extra_patterns):] = (extra_patterns) assert n == len(smarts_patterns) # list of (bit field, single_match_function) bit_patterns = [] for i, smarts in enumerate(smarts_patterns): pat = OESubSearch(smarts) assert pat.IsValid(), smarts bit_patterns.append((1<<i, pat.SingleMatch)) # Convert the match into an integer. A bit # is set of the corresponding single_match matches. # This ends up being a lot faster than the other # approaches I tried. def to_fp(mol): n = 0 for bit, single_match in bit_patterns: if single_match(mol): n |= bit return n def load_smiles(filename): for line in open(filename): smiles = line.split()[0] mol = OEGraphMol() OEParseSmiles(mol, smiles) #OESuppressHydrogens(mol) OEAssignAromaticFlags(mol) yield to_fp(mol), OECreateCanSmiString(mol), mol def is_subset(target, query): return target & query == query queries = list(load_smiles("queries.txt")) targets = list(load_smiles("mols.1000.txt")) # Test the entire query set against the targets. # Keep track of how many compounds are not # filtered by each query. sum_n = 0 for query_fp, query_smiles, query_mol in queries: n = 0 for target_fp, target_smiles, target_mol in targets: if is_subset(target_fp, query_fp): n += 1 print query_smiles, "does not filter", n sum_n += n # Get the average number of compounds which # are not filtered, then generate the # screening effectiveness. unfiltered = sum_n/float(len(queries)) print "average number not filtered", unfiltered print "filter rate:", (len(targets)-unfiltered)/(len(targets)+0.0)
# Find a set of substructure keys which are most divisive in # classifying a set of compounds. # 1. Generate non-intersecting linear hash substructure keys for a set of compounds. # 2. Those keys define the initial cluster. # 3. Use a greedy algorithm to: # - Find the largest groups # - Of those, find the pathes which is closest to splitting its group in 1/2 # - Of those, pick one of the paths which is shortest. # - This is the pat used to partition the clusters # 4. Partition each existing clusters into two new clusters: one containing # molecules with the path, the other containing those without the path # 5. Remove molecules which have no remaining paths to consider, and remove # clusters which are too small # 6. Go to #3 until there are enough split paths. from openeye.oechem import * from collections import defaultdict # Helper definitions for naming atoms and bonds # when generating the SMARTS for each path element_symbols = ( "* H He Li Be B C N O F Ne Na Mg Al Si P S Cl Ar K Ca " + "Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge As Se Br Kr Rb Sr Y Zr " + "Nb Mo Tc Ru Rh Pd Ag Cd In Sn Sb Te I Xe Cs Ba La Ce Pr Nd " + "Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu Hf Ta W Re Os Ir Pt Au Hg " + "Tl Pb Bi Po At Rn Fr Ra Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm " + "Md No Lr Rf Db Sg Bh Hs Mt Ds Rg").split() organic_subset = set("* B C N O P S F Cl Br I".split()) aromatic = set("C N O P S As Se".split()) # Keyed by (atomic number, is_aromatic) atom_symbols = {} # These must be inside of brackets. Non-aromatic cases. for eleno, symbol in enumerate(element_symbols): atom_symbols[(eleno, 0)] = "[%s]" % symbol # Except these, which don't need the brackets. Non-aromatic cases for symbol in organic_subset: eleno = element_symbols.index(symbol) atom_symbols[(eleno, 0)] = symbol # These can be aromatic for symbol in aromatic: eleno = element_symbols.index(symbol) if symbol in organic_subset: atom_symbols[(eleno, 1)] = symbol.lower() else: atom_symbols[(eleno, 1)] = "[%s]" % symbol.lower() # SMARTS pattern for each atom def atom_pattern(atom, _atom_symbols=atom_symbols): return _atom_symbols[atom.GetAtomicNum(), atom.IsAromatic()] bond_symbols = { (1, 0): "-", (1, 1): "~", (2, 0): "=", (2, 1): "~", (3, 0): "#", (4, 0): "$", } # SMARTS pattern for each bond type (1, 2, 3, 4 and "~" for aromatic) def bond_pattern(bond, _bond_symbols=bond_symbols): return _bond_symbols[bond.GetOrder(), bond.IsAromatic()] ## This is a direct way to generate linear paths of up to length 7 # It's a bit tedious, but easy to understand and doesn't involve # any recursion. def generate_linear_paths(mol): ATOM_PATTERN = atom_pattern BOND_PATTERN = bond_pattern # This assumes that GetIdx returns the atom/bond index. # That assumption breaks if the molecule was ever edited num_atoms = mol.NumAtoms() num_bonds = mol.NumBonds() visited_atoms = [0] * num_atoms # Don't include hydrogens, except a few special cases for atom in mol.GetAtoms(): if atom.GetAtomicNum() == 1: if atom.GetDegree() == 0: # hydrogen atom pass elif atom.GetDegree() >= 2: # more than one bond pass elif atom.GetHvyDegree() == 0: # bonded to hydrogen pass else: # Ignore this hydrogen visited_atoms[atom.GetIdx()] = 1 # path must have 7 atoms, because that's how I wrote it path_N = 7 terms = [None] * (path_N*2-1) for atom1 in mol.GetAtoms(): if visited_atoms[atom1.GetIdx()]: # Only possible if this is a hydrogen continue visited_atoms[atom1.GetIdx()] = 1 # linear fragment of length 1 terms[0] = ATOM_PATTERN(atom1) yield terms[:1] for bond1 in atom1.GetBonds(): atom2 = bond1.GetNbr(atom1) if visited_atoms[atom2.GetIdx()]: # Only possible if this is a hydrogen continue visited_atoms[atom2.GetIdx()] = 1 # path of length 2 terms[1] = BOND_PATTERN(bond1) terms[2] = ATOM_PATTERN(atom2) yield terms[:3] for bond2 in atom2.GetBonds(): atom3 = bond2.GetNbr(atom2) if visited_atoms[atom3.GetIdx()]: continue visited_atoms[atom3.GetIdx()] = 1 # path of length 3 terms[3] = BOND_PATTERN(bond2) terms[4] = ATOM_PATTERN(atom3) yield terms[:5] for bond3 in atom3.GetBonds(): atom4 = bond3.GetNbr(atom3) if visited_atoms[atom4.GetIdx()]: # cycle with 3 atoms continue visited_atoms[atom4.GetIdx()] = 1 # path of length 4 terms[5] = BOND_PATTERN(bond3) terms[6] = ATOM_PATTERN(atom4) yield terms[:7] for bond4 in atom4.GetBonds(): atom5 = bond4.GetNbr(atom4) if visited_atoms[atom5.GetIdx()]: # cycle with 4 atoms continue visited_atoms[atom5.GetIdx()] = 1 terms[7] = BOND_PATTERN(bond4) terms[8] = ATOM_PATTERN(atom5) yield terms[:9] for bond5 in atom5.GetBonds(): atom6 = bond5.GetNbr(atom5) if visited_atoms[atom6.GetIdx()]: # cycle with 5 atoms continue #visited_atoms[atom6.GetIdx()] = 1 terms[9] = BOND_PATTERN(bond5) terms[10] = ATOM_PATTERN(atom6) yield terms[:11] #continue for bond6 in atom6.GetBonds(): atom7 = bond6.GetNbr(atom6) if visited_atoms[atom7.GetIdx()]: # cycle with 6 atoms continue #visited_atoms[atom7.GetIdx()] = 1 terms[11] = BOND_PATTERN(bond6) terms[12] = ATOM_PATTERN(atom7) yield terms[:13] #visited_atoms[atom7.GetIdx()] = 0 #visited_atoms[atom6.GetIdx()] = 0 visited_atoms[atom5.GetIdx()] = 0 visited_atoms[atom4.GetIdx()] = 0 visited_atoms[atom3.GetIdx()] = 0 visited_atoms[atom2.GetIdx()] = 0 visited_atoms[atom1.GetIdx()] = 0 # the linear paths above are not unique. # Remove duplicates, including duplicates # caused by symmetry. I do a simple path # canonicalization by creating both the # forward and reverse SMARTS and using the # one which is alphabetically smallest. def generate_unique_linear_smarts(mol): unique = set() for path in generate_linear_paths(mol): s1 = "".join(path) s2 = "".join(path[::-1]) s = min(s1, s2) if s in unique: continue unique.add(s) yield s # The NCI and Greg's data sets contain duplicate structures. # Remove them. def read_unique_mols(infile, limit=None): if limit is None: limit == -1 else: assert limit >= 0 mol = OEGraphMol() unique_smiles = set() for i, line in enumerate(infile): if i == limit: break if i % 100 == 0: print "Processing compound", i smiles = line.split()[0] # This is likely not canonical # Convert to canonical form to # test for duplicate structures. mol.Clear() OEParseSmiles(mol, smiles) OEAssignAromaticFlags(mol) smiles2 = OECreateCanSmiString(mol) if smiles2 in unique_smiles: continue unique_smiles.add(smiles2) yield smiles2, mol # path_to_smiless is a table mapping: # SMARTS path -> set of SMILES containing the path # smiles_to_paths is a table mapping: # SMILES -> set of all paths contained in that SMILES class Cluster(object): def __init__(self, path_to_smiless, smiles_to_paths): self.path_to_smiless = path_to_smiless self.smiles_to_paths = smiles_to_paths def __len__(self): # The number of compounds in the cluster return len(self.smiles_to_paths) def make_initial_cluster(mol_input): # This saves a few percent of memory by # sharing the string for common paths. # I see about 10% savings. intern_table = {} smiles_to_paths = defaultdict(set) path_to_smiless = defaultdict(set) for smiles, mol in mol_input: for path in generate_unique_linear_smarts(mol): path = intern_table.setdefault(path, path) smiles_to_paths[smiles].add(path) path_to_smiless[path].add(smiles) return Cluster(path_to_smiless, smiles_to_paths) # I could compute this once when I generate the path # but I didn't know I needed it until the code was # almost complete. def num_bonds_in_path(s): return sum(c in "-~=#$" for c in s) # Find num_bit SMARTS partitionings for the clusters. # This uses a greedy algorithm. def partition(clusters, num_bits, max_cluster_size=2): while num_bits > 0: if not clusters: raise TypeError("no compounds left") num_bits -= 1 # Find the cluster with the largest number of compounds in it #print "Sizes", map(len, clusters) largest_clusters = None largest_n = -1 for cluster in clusters: n = len(cluster) if n > largest_n: largest_n = n largest_clusters = [cluster] elif n == largest_n: largest_clusters.append(cluster) # Of these, find the paths which are closest to # partitioning the cluster in half. average = largest_n//2 best_paths = None best_dist = largest_n+1 for cluster in largest_clusters: for path, smiless in cluster.path_to_smiless.items(): dist = abs(len(smiless)-average) if dist < best_dist: best_dist = dist best_paths = set([path]) elif dist == best_dist: best_paths.add(path) # Of these SMARTS paths, pick the shortest one # This is more likely to be able to split other clusters. split_path = min(best_paths, key=num_bonds_in_path) yield (largest_n, split_path, best_dist) #print "largest %s; split on %s (dist=%s)" % (n, split_path, dist) # Repartition all of the clusters based on using the new # split path. Each cluster can be split into two parts: # - molecules which have the path # - molecules which don't have the path # Of course, if the cluster is too small, then just ignore it. new_clusters = [] for cluster in clusters: # Split each cluster into one or two other clusters. All structures # with this path go into one, those without go into the other. smiless = cluster.path_to_smiless.get(split_path, set()) cluster_with_split_path = Cluster(defaultdict(set), defaultdict(set)) cluster_without_split_path = Cluster(defaultdict(set), defaultdict(set)) for smiles, paths in cluster.smiles_to_paths.items(): if split_path in paths: # If the compound has only one path left, then # it can't be split any more; ignore it. if len(paths) > 1: cluster_with_split_path.smiles_to_paths[smiles] = paths.copy() cluster_with_split_path.smiles_to_paths[smiles].remove(split_path) for path in cluster_with_split_path.smiles_to_paths[smiles]: cluster_with_split_path.path_to_smiless[path].add(smiles) else: # split_path not in the molecule's path set cluster_without_split_path.smiles_to_paths[smiles] = paths.copy() for path in cluster_without_split_path.smiles_to_paths[smiles]: cluster_without_split_path.path_to_smiless[path].add(smiles) cluster_without_split_path.smiles_to_paths[smiles] = paths.copy() for path in cluster_without_split_path.smiles_to_paths[smiles]: cluster_without_split_path.path_to_smiless[path].add(smiles) for cluster in (cluster_with_split_path, cluster_without_split_path): # It's useless worrying about clusters that are too small if len(cluster) > max_cluster_size: new_clusters.append(cluster) clusters = new_clusters # There are quite a few paths which only exist in a small # number of molecules. If that count is small enough then # they will never be a good split key, so I might as well # remove the possibility now. def remove_rare_paths(cluster, min_membership_size): for path, smiless in cluster.path_to_smiless.items(): if len(smiless) < min_membership_size: #print "Remove path", path for smiles in smiless: cluster.smiles_to_paths[smiles].remove(path) del cluster.path_to_smiless[path] for smiles, paths in cluster.smiles_to_paths.items(): # This molecule only has very rare paths? # What is it, [Dy] or something exotic like that? if len(paths) == 0: del cluster.smiles_to_paths[smiles] def main(): #filename = "mols.1000.txt" filename = "/Users/dalke/databases/nci_09425001_09450000.smi" molecule_limit = None max_cluster_size = 4 num_cutoffs = 256 mol_input = read_unique_mols(open(filename, "U"), limit=molecule_limit) initial_cluster = make_initial_cluster(mol_input) remove_rare_paths(initial_cluster, max_cluster_size) clusters = [initial_cluster] for i, (n, path, dist) in enumerate(partition(clusters, num_cutoffs, max_cluster_size)): print "%03s: largest=%s, split using %s (dist=%s)" % (i, n, path, dist) ## TEST FUNCTIONS def test_symbols(): assert atom_symbols[(34, 0)] == "[Se]" assert atom_symbols[(34, 1)] == "[se]" assert atom_symbols[(6, 0)] == "C" assert atom_symbols[(6, 1)] == "c" assert atom_symbols[(5, 0)] == "B" assert (5, 1) not in atom_symbols def test_generate_unique_linear_paths(): mol = OEGraphMol() OEParseSmiles(mol, "C=NOSPBFBr") paths = set(generate_unique_linear_smarts(mol)) expected_paths = ("C C=N C=N-O C=N-O-S C=N-O-S-P B-P-S-O-N=C C=N-O-S-P-B-F " + "N N-O N-O-S N-O-S-P B-P-S-O-N F-B-P-S-O-N Br-F-B-P-S-O-N " + "O O-S O-S-P B-P-S-O F-B-P-S-O Br-F-B-P-S-O " + "S P-S B-P-S F-B-P-S Br-F-B-P-S " + "P B-P F-B-P Br-F-B-P " + "B B-F B-F-Br " + "F Br-F " + "Br").split() for path in paths: assert path in expected_paths, path assert len(paths) == len(expected_paths) mol.Clear() OEParseSmiles(mol, "c1ccccccc1") paths = set(generate_unique_linear_smarts(mol)) expected_paths = "c c~c c~c~c c~c~c~c c~c~c~c~c c~c~c~c~c~c c~c~c~c~c~c~c".split() for path in paths: assert path in expected_paths assert len(paths) == len(expected_paths) mol.Clear() OEParseSmiles(mol, "c1ccncc1") paths = set(generate_unique_linear_smarts(mol)) expected_paths = ("c c~c c~c~c c~c~c~c c~c~c~c~c " + "n c~n c~c~n c~c~c~n c~c~c~c~n c~c~c~c~c~n " + " c~n~c c~c~n~c c~c~c~n~c c~c~c~c~n~c " + " c~c~n~c~c c~c~c~n~c~c").split() for path in paths: assert path in expected_paths, path for path in expected_paths: assert path in paths, path assert len(paths) == len(expected_paths) # Check that I don't loop for smiles, max_bond_count in ( ("C1CC1", 2), ("C1CCC1", 3), ("C1CCCC1", 4), ("C1CCCCC1", 5), ("C1CCCCCC1", 6), ("CC1CC1", 3), ("CC1CCC1", 4), ("CC1CCCC1", 5), ("CC1CCCCC1", 6), ("CCC1CC1", 4), ("CCC1CCC1", 5), ("CCC1CCCC1", 6), ("CCCC1CC1", 5), ("CCCC1CCC1", 6), ("CCCCC1CC1", 6) ): mol.Clear() assert OEParseSmiles(mol, smiles) for path in generate_unique_linear_smarts(mol): assert path.count("-") <= max_bond_count # Special symbols mol.Clear() OEParseSmiles(mol, "[C]#[238U]") assert set(generate_unique_linear_smarts(mol)) == set("C [U] C#[U]".split()) mol.Clear() OEParseSmiles(mol, "[Cl]$[Pt]") assert set(generate_unique_linear_smarts(mol)) == set("Cl [Pt] Cl$[Pt]".split()) # Hydrogens mol.Clear() OEParseSmiles(mol, "C([2H])([2H])([2H])[H]") assert mol.NumAtoms() == 5 assert set(generate_unique_linear_smarts(mol)) == set(["C"]) mol.Clear() OEParseSmiles(mol, "[H]") assert set(generate_unique_linear_smarts(mol)) == set(["[H]"]) mol.Clear() OEParseSmiles(mol, "C[H]C") assert set(generate_unique_linear_smarts(mol)) == set("C [H] C-[H] C-[H]-C".split()) mol.Clear() OEParseSmiles(mol, "[H][H]") assert set(generate_unique_linear_smarts(mol)) == set("[H] [H]-[H]".split()) def test(): test_symbols() test_generate_unique_linear_paths() if __name__ == "__main__": test() main()
Andrew da...@dalkescientific.com