I have updated my code, fixed bugs, cleaned things up, improved
performance and memory use. The biggest change was during path
selection, for the path used to partition all of the clusters.
I now have a much stronger bias towards selecting small paths,
rather than the first one I find. This tends to do a better
job of splitting other groups as well.

I generated a set of SMARTS-based non-intersecting linear
substructure keys for a set of PubChem molecules (CIDs
between 9425001 and 9450000, with 18,327 molecules in it).
I used my "linear_hash.py" algorithm (attached) to find
256 partition paths. This took about 12 minutes and 500 MB.

I used the resulting paths in "evaluate_hash_fingerprint.py"
(attached) to evaluate the effectiveness of the fingerprints
using Greg's query and target data. The results are:


  First  16 bits filter 46% of the target set
  First  32 bits filter 65%
  First  64 bits filter 83%
  First  96 bits filter 86%
  First 128 bits filter 90%
  All   256 bits filter 90%

This compares very well to Greg's results, where his
layered fingerprints filtered out 91% of the structures
using 1024 bits.

The paths I used were based on target analysis. I think
I can do better with query analysis. I want to generate
all fragments for a target data set then use a GA
using a scoring function against a query data set in
order to optimize screening effectiveness for different
sized fingerprints.

I don't have any time to do that this week.

Here are the above mentioned attachments. While I
used OEChem to generate the patterns, the results
are simple SMARTS patterns that can be used by any
toolkit. They are in "evaluate_hash_fingerprint.py".

# Evaluate the filter ability of a set of SMARTS-based fingerprints.
#
# The evaluation uses a set of substructure queries to filter
# a target data set. The goal is to maximize the number of
# compounds which are rejected by the queries.

# The pattern definitions below come from running linear_hash.py
# against PubChem compounds 9425001 to 9450000, set to
# generate 256 paths.
#
# The query and target data sets come from Greg Landrum.
# Note: they contain duplicates so the results aren't as
# precise as the numbers I give.

# I ran this code for different sized fingerprints, done by
# selecting the first N patterns. Results are:

# First  16 bits filter 46% of the target set
# First  32 bits filter 65%
# First  64 bits filter 83%
# First  96 bits filter 86%
# First 128 bits filter 90%
# All   256 bits filter 90%

# Looks like 128 bits is optimal.

# I also tried some extra cycle patterns which replace
# the last linear bits.
#  16 bits => 51%
#  32 bits => 68%
#  64 bits => 83%
#  96 bits => 86%
# 128 bits => 88%
# 256 bits => 90%

# Looks like the cycles I chose weren't all that helpful.

from openeye.oechem import *

NUM_BITS = 256
USE_CYCLE_PATHS = 0

# output from linear_hash.py

linear_hashes = """
  0: largest=18327, split using C-C-N-c (dist=29)
  1: largest=9192, split using C-N-C=O (dist=6)
  2: largest=5178, split using C-C-C (dist=24)
  3: largest=2676, split using C-C-c (dist=36)
  4: largest=1885, split using C-C-N-C-c (dist=3)
  5: largest=1227, split using C-C (dist=7)
  6: largest=1152, split using C-C-O (dist=2)
  7: largest=857, split using C-O (dist=12)
  8: largest=739, split using C-C-C-N-C=O (dist=2)
  9: largest=651, split using C-N-N-C=O (dist=19)
 10: largest=625, split using C-c~c~c (dist=16)
 11: largest=544, split using O-c~c-O (dist=9)
 12: largest=514, split using C-C-N-C-C (dist=19)
 13: largest=318, split using O-C-C-O (dist=9)
 14: largest=317, split using C=C-C-N-C-C-N (dist=1)
 15: largest=301, split using C-c~c~c~s~c-C (dist=6)
 16: largest=268, split using F (dist=14)
 17: largest=170, split using N-c (dist=15)
 18: largest=167, split using Cl-c~c~c~c (dist=0)
 19: largest=128, split using C-F (dist=1)
 20: largest=110, split using C-c~c~c-O (dist=3)
 21: largest=109, split using C-c~c~c~c~c (dist=12)
 22: largest=105, split using O-c~c-O-C-C=O (dist=2)
 23: largest=100, split using C-c~c~c~c~c-C (dist=2)
 24: largest=89, split using F-c~c~c-N (dist=0)
 25: largest=86, split using C-C=O (dist=2)
 26: largest=86, split using C-c~c~c~c-N (dist=21)
 27: largest=85, split using O=C-N-C-C-N-c (dist=2)
 28: largest=84, split using N-C-C-c (dist=1)
 29: largest=73, split using C-C-C-C-C (dist=0)
 30: largest=67, split using C-O-c~c~c~c-C (dist=2)
 31: largest=64, split using C-c~c-N (dist=12)
 32: largest=63, split using N-C-C-C-C-N (dist=0)
 33: largest=50, split using N-c~c-O (dist=13)
 34: largest=47, split using C=C-c~c~c~c~c (dist=2)
 35: largest=44, split using C-C-O-c~c~c-C (dist=2)
 36: largest=42, split using n (dist=1)
 37: largest=37, split using C-C-C-C (dist=1)
 38: largest=36, split using C-C-C-C-C-C-C (dist=0)
 39: largest=36, split using S (dist=6)
 40: largest=34, split using C=C-c~c~c~c-O (dist=1)
 41: largest=32, split using N-c~c~c~c-O (dist=8)
 42: largest=31, split using O-c~c~c-C=O (dist=1)
 43: largest=26, split using Cl-c~c-N (dist=1)
 44: largest=24, split using C-C-c~c~c~c-N (dist=2)
 45: largest=24, split using Br (dist=4)
 46: largest=22, split using c-c (dist=1)
 47: largest=22, split using O=C-c (dist=1)
 48: largest=20, split using C-C-n (dist=0)
 49: largest=19, split using C-c~c~c-N (dist=2)
 50: largest=18, split using O=S (dist=0)
 51: largest=18, split using F-c~c-N (dist=1)
 52: largest=18, split using C-C-C=O (dist=1)
 53: largest=16, split using n~c~n (dist=0)
 54: largest=16, split using C-c~c~c~c-C (dist=0)
 55: largest=16, split using I (dist=2)
 56: largest=16, split using N=O (dist=2)
 57: largest=16, split using O=C-N-C-c (dist=2)
 58: largest=15, split using C-c~c~c~c-Cl (dist=0)
 59: largest=14, split using O-c~c~c-O (dist=1)
 60: largest=14, split using c~c~c~c~c~c~c (dist=1)
 61: largest=14, split using C-c~c~c~c-F (dist=3)
 62: largest=13, split using C-C-C-N-C-C-C (dist=3)
 63: largest=12, split using s (dist=0)
 64: largest=12, split using C-C-C-C-N (dist=0)
 65: largest=12, split using C-N-C-N (dist=2)
 66: largest=12, split using Cl-c~c~c-N (dist=2)
 67: largest=10, split using n~n (dist=1)
 68: largest=10, split using C-c~n (dist=1)
 69: largest=10, split using O-c~c~c~c-O (dist=1)
 70: largest=10, split using O-c~c-C-N-C=O (dist=1)
 71: largest=10, split using C-O-c~c~c-C=C (dist=1)
 72: largest=10, split using N-C-C-C-C=O (dist=3)
 73: largest=9, split using C-c~c~c-S-N (dist=0)
 74: largest=9, split using C-C-C-C-C=O (dist=3)
 75: largest=8, split using Cl (dist=0)
 76: largest=8, split using C-n (dist=0)
 77: largest=8, split using C-O-c (dist=0)
 78: largest=8, split using C-c~c-c (dist=0)
 79: largest=8, split using C-C-O-C (dist=0)
 80: largest=8, split using C-c~c-F (dist=0)
 81: largest=8, split using C-c~c-C (dist=0)
 82: largest=8, split using O=C-N-c (dist=0)
 83: largest=8, split using Cl-c~c~c-Cl (dist=0)
 84: largest=8, split using C-S-c~c-N (dist=0)
 85: largest=8, split using Br-c~c~c-C (dist=0)
 86: largest=8, split using F-c~c~c-F (dist=0)
 87: largest=8, split using N-c~c~c-S (dist=0)
 88: largest=8, split using C-N-C-N-C (dist=0)
 89: largest=8, split using C-C-c~c~c-C (dist=0)
 90: largest=8, split using C-C-C-C-C-C (dist=0)
 91: largest=8, split using C#N (dist=2)
 92: largest=8, split using C=C (dist=2)
 93: largest=8, split using O-C-O (dist=2)
 94: largest=8, split using c-C-c (dist=2)
 95: largest=8, split using C (dist=4)
 96: largest=8, split using O (dist=4)
 97: largest=8, split using c (dist=4)
 98: largest=8, split using N (dist=4)
 99: largest=8, split using C-c (dist=4)
100: largest=8, split using C-N (dist=4)
101: largest=8, split using C-S (dist=4)
102: largest=8, split using N-S (dist=4)
103: largest=8, split using C=O (dist=4)
104: largest=8, split using c~c (dist=4)
105: largest=8, split using c~c~c (dist=4)
106: largest=8, split using C-S=O (dist=4)
107: largest=8, split using C-c~c (dist=4)
108: largest=8, split using C-N-S (dist=4)
109: largest=8, split using O=S=O (dist=4)
110: largest=8, split using N-C=O (dist=4)
111: largest=8, split using C-C-N (dist=4)
112: largest=8, split using N-S=O (dist=4)
113: largest=8, split using S-C-c (dist=4)
114: largest=8, split using C-S-N (dist=4)
115: largest=8, split using C-N-C (dist=4)
116: largest=8, split using C-C-C-N (dist=4)
117: largest=8, split using N-C-C-N (dist=4)
118: largest=8, split using C-C-N-S (dist=4)
119: largest=8, split using N-S-C-c (dist=4)
120: largest=8, split using O=S-C-c (dist=4)
121: largest=8, split using c~c~c~c (dist=4)
122: largest=8, split using C-C-N-C (dist=4)
123: largest=8, split using C-N-S=O (dist=4)
124: largest=8, split using N-C-C=O (dist=4)
125: largest=8, split using S-C-c~c (dist=4)
126: largest=8, split using C-N-S-C (dist=4)
127: largest=8, split using C-C-C-N-C (dist=4)
128: largest=8, split using C-C-N-S-C (dist=4)
129: largest=8, split using C-N-C-C=O (dist=4)
130: largest=8, split using O=S-C-c~c (dist=4)
131: largest=8, split using C-c~c~c~c (dist=4)
132: largest=8, split using c~c~c~c~c (dist=4)
133: largest=8, split using N-C-C-N-S (dist=4)
134: largest=8, split using C-C-C-C=O (dist=4)
135: largest=8, split using S-C-c~c~c (dist=4)
136: largest=8, split using N-S-C-c~c (dist=4)
137: largest=8, split using C-C-N-C=O (dist=4)
138: largest=8, split using C-N-S-C-c (dist=4)
139: largest=8, split using C-N-C-C-N (dist=4)
140: largest=8, split using C-C-N-S=O (dist=4)
141: largest=8, split using C-S-N-C-C-N (dist=4)
142: largest=8, split using C-C-C-N-C-C (dist=4)
143: largest=8, split using C-C-N-C-C-N (dist=4)
144: largest=8, split using c~c~c~c~c~c (dist=4)
145: largest=8, split using N-C-C-N-S=O (dist=4)
146: largest=8, split using C-C-C-C-N-C (dist=4)
147: largest=8, split using C-C-N-C-C=O (dist=4)
148: largest=8, split using C-N-C-C-N-C (dist=4)
149: largest=8, split using C-N-C-C-N-S (dist=4)
150: largest=8, split using O=S-C-c~c~c (dist=4)
151: largest=8, split using C-C-N-S-C-c (dist=4)
152: largest=8, split using C-C-C-C-C-N (dist=4)
153: largest=8, split using N-S-C-c~c~c (dist=4)
154: largest=8, split using C-N-S-C-c~c (dist=4)
155: largest=8, split using S-C-c~c~c~c (dist=4)
156: largest=8, split using C-C-C-N-C-C-N (dist=4)
157: largest=8, split using C-N-C-C-C-C-N (dist=4)
158: largest=8, split using C-C-C-C-N-C-C (dist=4)
159: largest=8, split using C-N-S-C-c~c~c (dist=4)
160: largest=8, split using C-C-C-C-C-C-N (dist=4)
161: largest=8, split using C-C-C-C-N-C=O (dist=4)
162: largest=8, split using N-C-C-N-C-C=O (dist=4)
163: largest=8, split using N-S-C-c~c~c~c (dist=4)
164: largest=8, split using C-C-N-S-C-c~c (dist=4)
165: largest=8, split using N-C-C-N-S-C-c (dist=4)
166: largest=8, split using C-C-N-C-C-N-S (dist=4)
167: largest=8, split using C-N-C-C-N-S=O (dist=4)
168: largest=8, split using O=S-C-c~c~c~c (dist=4)
169: largest=8, split using C-C-C-C-C-N-C (dist=4)
170: largest=8, split using C-N-C-C-N-S-C (dist=4)
171: largest=8, split using C-N-C-C-C-C=O (dist=4)
172: largest=8, split using C-C-N-C-C-N-C (dist=4)
173: largest=8, split using C-c~c~c~c~c~c (dist=4)
174: largest=8, split using C-C-C-N-C-C=O (dist=4)
175: largest=8, split using S-C-c~c~c~c~c (dist=4)
176: largest=8, split using N-C-C-N-C-C-N (dist=4)
177: largest=7, split using Cl-c~c-Cl (dist=0)
178: largest=6, split using C-c~c-O (dist=0)
179: largest=6, split using C-C-O-C-C (dist=0)
180: largest=6, split using o (dist=1)
181: largest=6, split using Br-c~c-N (dist=1)
182: largest=6, split using C-C-O-c (dist=1)
183: largest=6, split using C-c~c~n (dist=1)
184: largest=6, split using F-c~c-F (dist=1)
185: largest=6, split using Cl-c~c-O (dist=1)
186: largest=6, split using N-c~c-N (dist=1)
187: largest=6, split using C-C-C-n (dist=1)
188: largest=6, split using I-c~c-N (dist=1)
189: largest=6, split using C-c~c~c-Cl (dist=1)
190: largest=6, split using c-C-N-C-c (dist=1)
191: largest=6, split using N-c~c~c-O (dist=1)
192: largest=6, split using C-O-c~c-O-C (dist=1)
193: largest=6, split using C-c~c~c-C=C (dist=1)
194: largest=6, split using C-C-O-c~c-O (dist=1)
195: largest=6, split using Cl-c~c~c~c-N (dist=1)
196: largest=6, split using C-C-c~c~c~c~c (dist=1)
197: largest=6, split using C-C-N-C-N-C=O (dist=1)
198: largest=6, split using C-C-c~c~c-C-C (dist=1)
199: largest=6, split using C-N-c~c~c~c-C (dist=1)
200: largest=6, split using Cl-c (dist=3)
201: largest=6, split using F-c (dist=3)
202: largest=6, split using O-c (dist=3)
203: largest=6, split using c~s (dist=3)
204: largest=6, split using F-C-F (dist=3)
205: largest=6, split using c~c~s (dist=3)
206: largest=6, split using F-C-c (dist=3)
207: largest=6, split using N-c~c (dist=3)
208: largest=6, split using C-S-C (dist=3)
209: largest=6, split using O-c~c (dist=3)
210: largest=6, split using N-C-c (dist=3)
211: largest=6, split using N-C-N (dist=3)
212: largest=6, split using C-c~s (dist=3)
213: largest=6, split using C-C=C (dist=3)
214: largest=6, split using C=C-c (dist=3)
215: largest=6, split using C-N-c (dist=3)
216: largest=6, split using C-C-S (dist=3)
217: largest=6, split using O-C-c (dist=3)
218: largest=6, split using O-C=O (dist=3)
219: largest=6, split using c~s~c (dist=3)
220: largest=6, split using F-c~c (dist=3)
221: largest=6, split using Cl-c~c (dist=3)
222: largest=6, split using C-N-C-c (dist=3)
223: largest=6, split using C-C=C-c (dist=3)
224: largest=6, split using C=C-C-c (dist=3)
225: largest=6, split using C-C-S=O (dist=3)
226: largest=6, split using C=C-c~c (dist=3)
227: largest=6, split using c~c~s~c (dist=3)
228: largest=6, split using O=C-c~c (dist=3)
229: largest=6, split using C-C-c~c (dist=3)
230: largest=6, split using O-C-c~c (dist=3)
231: largest=6, split using O-C-C=O (dist=3)
232: largest=6, split using c-C-O-c (dist=3)
233: largest=6, split using Cl-c~c~c (dist=3)
234: largest=6, split using C-N-c~c (dist=3)
235: largest=6, split using C=C-C=O (dist=3)
236: largest=6, split using C-C-c~s (dist=3)
237: largest=6, split using C-C-S-C (dist=3)
238: largest=6, split using F-c~c~c (dist=3)
239: largest=6, split using C-C-C-c (dist=3)
240: largest=6, split using N-C-C-S (dist=3)
241: largest=6, split using C-C-C-O (dist=3)
242: largest=6, split using C-C-C-S (dist=3)
243: largest=6, split using N-C-c~s (dist=3)
244: largest=6, split using N-C-c~c (dist=3)
245: largest=6, split using F-c~c-O (dist=3)
246: largest=6, split using N-c~c~c (dist=3)
247: largest=6, split using C-O-c~c (dist=3)
248: largest=6, split using O-c~c~c (dist=3)
249: largest=6, split using c~c~c~s (dist=3)
250: largest=6, split using F-C-c~c (dist=3)
251: largest=6, split using C-c~s~c (dist=3)
252: largest=6, split using C-C-C-C-S (dist=3)
253: largest=6, split using C-C-c~s~c (dist=3)
254: largest=6, split using C-C-C-c~s (dist=3)
255: largest=6, split using O=C-C=C-c (dist=3)
""".splitlines()
# remove blank lines
linear_hashes = [x for x in linear_hashes if x]

assert len(linear_hashes) == 256, len(linear_hashes)

### Use this to change the number of patterns in the fp
assert 0 < NUM_BITS <= len(linear_hashes)
linear_hashes = linear_hashes[:NUM_BITS]

smarts_patterns = [line.split()[4] for line in linear_hashes]

if USE_CYCLE_PATHS:
    # Feel free to tweak these as you see fit.
    # Simple cycles relatively easy to detect using
    # as a side-effect of doing linear hashing.
    extra_patterns = [
        "*1**1",
        "*1***1",
        "*1****1",
        "*1*****1",
        "*1******1",
        "*1*******1",
        ]
    
    n = len(smarts_patterns)
    smarts_patterns[-len(extra_patterns):] = (extra_patterns)
    assert n == len(smarts_patterns)
    
            
# list of (bit field, single_match_function)
bit_patterns = []
for i, smarts in enumerate(smarts_patterns):
    pat = OESubSearch(smarts)
    assert pat.IsValid(), smarts
    bit_patterns.append((1<<i, pat.SingleMatch))

# Convert the match into an integer. A bit
# is set of the corresponding single_match matches.
# This ends up being a lot faster than the other
# approaches I tried.
def to_fp(mol):
    n = 0
    for bit, single_match in bit_patterns:
        if single_match(mol):
            n |= bit
    return n

def load_smiles(filename):
    for line in open(filename):
        smiles = line.split()[0]
        mol = OEGraphMol()
        OEParseSmiles(mol, smiles)
        #OESuppressHydrogens(mol)
        OEAssignAromaticFlags(mol)
        yield to_fp(mol), OECreateCanSmiString(mol), mol

def is_subset(target, query):
    return target & query == query


queries = list(load_smiles("queries.txt"))
targets = list(load_smiles("mols.1000.txt"))

# Test the entire query set against the targets.
# Keep track of how many compounds are not
# filtered by each query.
sum_n = 0
for query_fp, query_smiles, query_mol in queries:
    n = 0
    for target_fp, target_smiles, target_mol in targets:
        if is_subset(target_fp, query_fp):
            n += 1
    print query_smiles, "does not filter", n
    sum_n += n

# Get the average number of compounds which
# are not filtered, then generate the
# screening effectiveness.
unfiltered = sum_n/float(len(queries))
print "average number not filtered", unfiltered
print "filter rate:", (len(targets)-unfiltered)/(len(targets)+0.0)

# Find a set of substructure keys which are most divisive in
# classifying a set of compounds.

# 1. Generate non-intersecting linear hash substructure keys for a set of compounds.
# 2. Those keys define the initial cluster.
# 3. Use a greedy algorithm to:
#    - Find the largest groups
#    - Of those, find the pathes which is closest to splitting its group in 1/2
#    - Of those, pick one of the paths which is shortest.
#    - This is the pat used to partition the clusters
# 4. Partition each existing clusters into two new clusters: one containing
#      molecules with the path, the other containing those without the path
# 5. Remove molecules which have no remaining paths to consider, and remove
#      clusters which are too small
# 6. Go to #3 until there are enough split paths.

from openeye.oechem import *
from collections import defaultdict

# Helper definitions for naming atoms and bonds
# when generating the SMARTS for each path

element_symbols = (
    "* H He Li Be B C N O F Ne Na Mg Al Si P S Cl Ar K Ca " +
    "Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge As Se Br Kr Rb Sr Y Zr " +
    "Nb Mo Tc Ru Rh Pd Ag Cd In Sn Sb Te I Xe Cs Ba La Ce Pr Nd " +
    "Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu Hf Ta W Re Os Ir Pt Au Hg " +
    "Tl Pb Bi Po At Rn Fr Ra Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm " +
    "Md No Lr Rf Db Sg Bh Hs Mt Ds Rg").split()
organic_subset  = set("* B C N O P S F Cl Br I".split())
aromatic = set("C N O P S As Se".split())

# Keyed by (atomic number, is_aromatic)
atom_symbols = {}

# These must be inside of brackets. Non-aromatic cases.
for eleno, symbol in enumerate(element_symbols):
    atom_symbols[(eleno, 0)] = "[%s]" % symbol

# Except these, which don't need the brackets. Non-aromatic cases
for symbol in organic_subset:
    eleno = element_symbols.index(symbol)
    atom_symbols[(eleno, 0)] = symbol

# These can be aromatic
for symbol in aromatic:
    eleno = element_symbols.index(symbol)
    if symbol in organic_subset:
        atom_symbols[(eleno, 1)] = symbol.lower()
    else:
        atom_symbols[(eleno, 1)] = "[%s]" % symbol.lower()



# SMARTS pattern for each atom
def atom_pattern(atom, _atom_symbols=atom_symbols):
    return _atom_symbols[atom.GetAtomicNum(), atom.IsAromatic()]

bond_symbols = {
    (1, 0): "-",
    (1, 1): "~",
    (2, 0): "=",
    (2, 1): "~",
    (3, 0): "#",
    (4, 0): "$",
}

# SMARTS pattern for each bond type (1, 2, 3, 4 and "~" for aromatic)
def bond_pattern(bond, _bond_symbols=bond_symbols):
    return _bond_symbols[bond.GetOrder(), bond.IsAromatic()]

## This is a direct way to generate linear paths of up to length 7

# It's a bit tedious, but easy to understand and doesn't involve
# any recursion.

def generate_linear_paths(mol):
    ATOM_PATTERN = atom_pattern
    BOND_PATTERN = bond_pattern
    
    # This assumes that GetIdx returns the atom/bond index.
    # That assumption breaks if the molecule was ever edited
    num_atoms = mol.NumAtoms()
    num_bonds = mol.NumBonds()
    visited_atoms = [0] * num_atoms

    # Don't include hydrogens, except a few special cases
    for atom in mol.GetAtoms():
        if atom.GetAtomicNum() == 1:
            if atom.GetDegree() == 0:  # hydrogen atom
                pass
            elif atom.GetDegree() >= 2: # more than one bond
                pass
            elif atom.GetHvyDegree() == 0: # bonded to hydrogen
                pass
            else:
                # Ignore this hydrogen
                visited_atoms[atom.GetIdx()] = 1

    # path must have 7 atoms, because that's how I wrote it
    path_N = 7
    terms = [None] * (path_N*2-1)

    for atom1 in mol.GetAtoms():
        if visited_atoms[atom1.GetIdx()]:
            # Only possible if this is a hydrogen
            continue
        visited_atoms[atom1.GetIdx()] = 1
        # linear fragment of length 1
        terms[0] = ATOM_PATTERN(atom1)
        yield terms[:1]

        for bond1 in atom1.GetBonds():
            atom2 = bond1.GetNbr(atom1)
            if visited_atoms[atom2.GetIdx()]:
                # Only possible if this is a hydrogen
                continue
            visited_atoms[atom2.GetIdx()] = 1

            # path of length 2
            terms[1] = BOND_PATTERN(bond1)
            terms[2] = ATOM_PATTERN(atom2)
            yield terms[:3]

            for bond2 in atom2.GetBonds():
                atom3 = bond2.GetNbr(atom2)
                if visited_atoms[atom3.GetIdx()]:
                    continue
                visited_atoms[atom3.GetIdx()] = 1

                # path of length 3
                terms[3] = BOND_PATTERN(bond2)
                terms[4] = ATOM_PATTERN(atom3)
                yield terms[:5]

                for bond3 in atom3.GetBonds():
                    atom4 = bond3.GetNbr(atom3)
                    if visited_atoms[atom4.GetIdx()]:
                        # cycle with 3 atoms
                        continue
                    visited_atoms[atom4.GetIdx()] = 1

                    # path of length 4
                    terms[5] = BOND_PATTERN(bond3)
                    terms[6] = ATOM_PATTERN(atom4)

                    yield terms[:7]

                    for bond4 in atom4.GetBonds():
                        atom5 = bond4.GetNbr(atom4)
                        if visited_atoms[atom5.GetIdx()]:
                            # cycle with 4 atoms
                            continue
                        visited_atoms[atom5.GetIdx()] = 1

                        terms[7] = BOND_PATTERN(bond4)
                        terms[8] = ATOM_PATTERN(atom5)

                        yield terms[:9]

                        for bond5 in atom5.GetBonds():
                            atom6 = bond5.GetNbr(atom5)
                            if visited_atoms[atom6.GetIdx()]:
                                # cycle with 5 atoms
                                continue
                            #visited_atoms[atom6.GetIdx()] = 1

                            terms[9] = BOND_PATTERN(bond5)
                            terms[10] = ATOM_PATTERN(atom6)

                            yield terms[:11]
                            #continue

                            for bond6 in atom6.GetBonds():
                                atom7 = bond6.GetNbr(atom6)
                                if visited_atoms[atom7.GetIdx()]:
                                    # cycle with 6 atoms
                                    continue
                                #visited_atoms[atom7.GetIdx()] = 1

                                terms[11] = BOND_PATTERN(bond6)
                                terms[12] = ATOM_PATTERN(atom7)

                                yield terms[:13]

                                #visited_atoms[atom7.GetIdx()] = 0
                            #visited_atoms[atom6.GetIdx()] = 0
                        visited_atoms[atom5.GetIdx()] = 0
                    visited_atoms[atom4.GetIdx()] = 0
                visited_atoms[atom3.GetIdx()] = 0
            visited_atoms[atom2.GetIdx()] = 0
        visited_atoms[atom1.GetIdx()] = 0

# the linear paths above are not unique.
# Remove duplicates, including duplicates
# caused by symmetry. I do a simple path
# canonicalization by creating both the
# forward and reverse SMARTS and using the
# one which is alphabetically smallest.
def generate_unique_linear_smarts(mol):
    unique = set()
    for path in generate_linear_paths(mol):
        s1 = "".join(path)
        s2 = "".join(path[::-1])
        s = min(s1, s2)
        if s in unique:
            continue
        unique.add(s)
        yield s

# The NCI and Greg's data sets contain duplicate structures.
# Remove them.
def read_unique_mols(infile, limit=None):
    if limit is None:
        limit == -1
    else:
        assert limit >= 0
    mol = OEGraphMol()
    unique_smiles = set()
    for i, line in enumerate(infile):
        if i == limit:
            break
        if i % 100 == 0:
            print "Processing compound", i
        smiles = line.split()[0]
        # This is likely not canonical
        # Convert to canonical form to
        # test for duplicate structures.
        mol.Clear()
        OEParseSmiles(mol, smiles)
        OEAssignAromaticFlags(mol)
        smiles2 = OECreateCanSmiString(mol)
        if smiles2 in unique_smiles:
            continue
        unique_smiles.add(smiles2)
        yield smiles2, mol


# path_to_smiless is a table mapping:
#   SMARTS path -> set of SMILES containing the path
# smiles_to_paths is a table mapping:
#   SMILES -> set of all paths contained in that SMILES
class Cluster(object):
    def __init__(self, path_to_smiless, smiles_to_paths):
        self.path_to_smiless = path_to_smiless
        self.smiles_to_paths = smiles_to_paths
    def __len__(self):
        # The number of compounds in the cluster
        return len(self.smiles_to_paths)

def make_initial_cluster(mol_input):
    # This saves a few percent of memory by
    # sharing the string for common paths.
    # I see about 10% savings.
    intern_table = {}
    smiles_to_paths = defaultdict(set)
    path_to_smiless = defaultdict(set)
    
    for smiles, mol in mol_input:
        for path in generate_unique_linear_smarts(mol):
            path = intern_table.setdefault(path, path)

            smiles_to_paths[smiles].add(path)
            path_to_smiless[path].add(smiles)
    return Cluster(path_to_smiless, smiles_to_paths)

# I could compute this once when I generate the path
# but I didn't know I needed it until the code was
# almost complete.
def num_bonds_in_path(s):
    return sum(c in "-~=#$" for c in s)

# Find num_bit SMARTS partitionings for the clusters.
# This uses a greedy algorithm.
def partition(clusters, num_bits, max_cluster_size=2):
    while num_bits > 0:
        if not clusters:
            raise TypeError("no compounds left")
        num_bits -= 1

        # Find the cluster with the largest number of compounds in it
        #print "Sizes", map(len, clusters)
        largest_clusters = None
        largest_n = -1
        for cluster in clusters:
            n = len(cluster)
            if n > largest_n:
                largest_n = n
                largest_clusters = [cluster]
            elif n == largest_n:
                largest_clusters.append(cluster)
        
        # Of these, find the paths which are closest to
        # partitioning the cluster in half.
        average = largest_n//2

        best_paths = None
        best_dist = largest_n+1
        for cluster in largest_clusters:
            for path, smiless in cluster.path_to_smiless.items():
                dist = abs(len(smiless)-average)
                if dist < best_dist:
                    best_dist = dist
                    best_paths = set([path])
                elif dist == best_dist:
                    best_paths.add(path)

        # Of these SMARTS paths, pick the shortest one
        # This is more likely to be able to split other clusters.
        split_path = min(best_paths, key=num_bonds_in_path)

        yield (largest_n, split_path, best_dist)
        #print "largest %s; split on %s (dist=%s)" % (n, split_path, dist)
        
        # Repartition all of the clusters based on using the new
        # split path. Each cluster can be split into two parts:
        #   - molecules which have the path
        #   - molecules which don't have the path
        # Of course, if the cluster is too small, then just ignore it.

        new_clusters = []
        for cluster in clusters:
            # Split each cluster into one or two other clusters. All structures
            # with this path go into one, those without go into the other.
            smiless = cluster.path_to_smiless.get(split_path, set())
            cluster_with_split_path = Cluster(defaultdict(set), defaultdict(set))
            cluster_without_split_path = Cluster(defaultdict(set), defaultdict(set))

            for smiles, paths in cluster.smiles_to_paths.items():
                if split_path in paths:
                    # If the compound has only one path left, then
                    # it can't be split any more; ignore it.
                    if len(paths) > 1:
                        cluster_with_split_path.smiles_to_paths[smiles] = paths.copy()
                        cluster_with_split_path.smiles_to_paths[smiles].remove(split_path)
                        for path in cluster_with_split_path.smiles_to_paths[smiles]:
                            cluster_with_split_path.path_to_smiless[path].add(smiles)
                else:
                    # split_path not in the molecule's path set
                    cluster_without_split_path.smiles_to_paths[smiles] = paths.copy()
                    for path in cluster_without_split_path.smiles_to_paths[smiles]:
                        cluster_without_split_path.path_to_smiless[path].add(smiles)
                    cluster_without_split_path.smiles_to_paths[smiles] = paths.copy()
                    for path in cluster_without_split_path.smiles_to_paths[smiles]:
                        cluster_without_split_path.path_to_smiless[path].add(smiles)

            for cluster in (cluster_with_split_path, cluster_without_split_path):
                # It's useless worrying about clusters that are too small
                if len(cluster) > max_cluster_size:
                    new_clusters.append(cluster)

        clusters = new_clusters

# There are quite a few paths which only exist in a small
# number of molecules. If that count is small enough then
# they will never be a good split key, so I might as well
# remove the possibility now.
def remove_rare_paths(cluster, min_membership_size):
    for path, smiless in cluster.path_to_smiless.items():
        if len(smiless) < min_membership_size:
            #print "Remove path", path
            for smiles in smiless:
                cluster.smiles_to_paths[smiles].remove(path)
            del cluster.path_to_smiless[path]
    for smiles, paths in cluster.smiles_to_paths.items():
        # This molecule only has very rare paths?
        # What is it, [Dy] or something exotic like that?
        if len(paths) == 0:
            del cluster.smiles_to_paths[smiles]

            
def main():
    #filename = "mols.1000.txt"
    filename = "/Users/dalke/databases/nci_09425001_09450000.smi"
    molecule_limit = None
    max_cluster_size = 4
    num_cutoffs = 256

    mol_input = read_unique_mols(open(filename, "U"), limit=molecule_limit)
    initial_cluster = make_initial_cluster(mol_input)
    remove_rare_paths(initial_cluster, max_cluster_size)
    clusters = [initial_cluster]
    for i, (n, path, dist) in enumerate(partition(clusters, num_cutoffs, max_cluster_size)):
        print "%03s: largest=%s, split using %s (dist=%s)" % (i, n, path, dist)
        
## TEST FUNCTIONS

def test_symbols():
    assert atom_symbols[(34, 0)] == "[Se]"
    assert atom_symbols[(34, 1)] == "[se]"
    assert atom_symbols[(6, 0)] == "C"
    assert atom_symbols[(6, 1)] == "c"
    assert atom_symbols[(5, 0)] == "B"
    assert (5, 1) not in atom_symbols

def test_generate_unique_linear_paths():
    mol = OEGraphMol()
    OEParseSmiles(mol, "C=NOSPBFBr")
    paths = set(generate_unique_linear_smarts(mol))
    expected_paths = ("C C=N C=N-O C=N-O-S C=N-O-S-P B-P-S-O-N=C C=N-O-S-P-B-F " +
                      "N N-O N-O-S N-O-S-P B-P-S-O-N F-B-P-S-O-N Br-F-B-P-S-O-N " +
                      "O O-S O-S-P B-P-S-O F-B-P-S-O Br-F-B-P-S-O " +
                      "S P-S B-P-S F-B-P-S Br-F-B-P-S " +
                      "P B-P F-B-P Br-F-B-P " +
                      "B B-F B-F-Br " +
                      "F Br-F " +
                      "Br").split()
    for path in paths:
        assert path in expected_paths, path
    assert len(paths) == len(expected_paths)

    mol.Clear()
    OEParseSmiles(mol, "c1ccccccc1")
    paths = set(generate_unique_linear_smarts(mol))
    expected_paths = "c c~c c~c~c c~c~c~c c~c~c~c~c c~c~c~c~c~c c~c~c~c~c~c~c".split()
    for path in paths:
        assert path in expected_paths
    assert len(paths) == len(expected_paths)

    mol.Clear()
    OEParseSmiles(mol, "c1ccncc1")
    paths = set(generate_unique_linear_smarts(mol))
    expected_paths = ("c c~c c~c~c c~c~c~c c~c~c~c~c " +
                      "n c~n c~c~n c~c~c~n c~c~c~c~n c~c~c~c~c~n " +
                      "      c~n~c c~c~n~c c~c~c~n~c c~c~c~c~n~c " +
                      "                    c~c~n~c~c c~c~c~n~c~c").split()
    for path in paths:
        assert path in expected_paths, path
    for path in expected_paths:
        assert path in paths, path
    assert len(paths) == len(expected_paths)

    # Check that I don't loop
    for smiles, max_bond_count in ( ("C1CC1", 2),
                                    ("C1CCC1", 3),
                                    ("C1CCCC1", 4),
                                    ("C1CCCCC1", 5),
                                    ("C1CCCCCC1", 6),
                                    
                                    ("CC1CC1", 3),
                                    ("CC1CCC1", 4),
                                    ("CC1CCCC1", 5),
                                    ("CC1CCCCC1", 6),
                                    ("CCC1CC1", 4),
                                    ("CCC1CCC1", 5),
                                    ("CCC1CCCC1", 6),
                                    ("CCCC1CC1", 5),
                                    ("CCCC1CCC1", 6),
                                    ("CCCCC1CC1", 6) ):
        mol.Clear()
        assert OEParseSmiles(mol, smiles)
        for path in generate_unique_linear_smarts(mol):
            assert path.count("-") <= max_bond_count

    # Special symbols
    mol.Clear()
    OEParseSmiles(mol, "[C]#[238U]")
    assert set(generate_unique_linear_smarts(mol)) == set("C [U] C#[U]".split())
    mol.Clear()
    OEParseSmiles(mol, "[Cl]$[Pt]")
    assert set(generate_unique_linear_smarts(mol)) == set("Cl [Pt] Cl$[Pt]".split())

    # Hydrogens
    mol.Clear()
    OEParseSmiles(mol, "C([2H])([2H])([2H])[H]")
    assert mol.NumAtoms() == 5
    assert set(generate_unique_linear_smarts(mol)) == set(["C"])
    mol.Clear()
    OEParseSmiles(mol, "[H]")
    assert set(generate_unique_linear_smarts(mol)) == set(["[H]"])
    mol.Clear()
    OEParseSmiles(mol, "C[H]C")
    assert set(generate_unique_linear_smarts(mol)) == set("C [H] C-[H] C-[H]-C".split())
    mol.Clear()
    OEParseSmiles(mol, "[H][H]")
    assert set(generate_unique_linear_smarts(mol)) == set("[H] [H]-[H]".split())

def test():
    test_symbols()
    test_generate_unique_linear_paths()

if __name__ == "__main__":
    test()
    main()



                                Andrew
                                da...@dalkescientific.com


Reply via email to