[There's a lot here. I was experimenting with an approach but
ended up doubtful, but in need of a good test set.

Greg? Is your test data set of 1,000 compounds available? And
your fragments? I want to do some experiments of my own.

Anyone else? Know any good representative collection of
structure queries?]


My thought is that hash fingerprints are overkill. They handle
any sort of chemistry, but most chemistry (that we deal with)
is pretty regular. Or boring.

Substructure keys are based on human ideas of what's relevant.
I implemented a subset of the CACTVS keys and noticed that
they are, well, rather sparse. Looking now at an PubChem data
set I see about 70% of the bits are 0. There's also very
good correlation between some of the bits (if there are 4
carbons then there are always 2 carbons).

Also, very few of the compounds have, say, Dysprosium
(CACTVS bit number 108) so that's information perhaps
best stored in an inverted index. I started something
along this line a few years ago using Judy trees ...
Hmm... should think more about this ...



I took an idea from clustering. What about generating
paths for a set of compounds to see which are the most
divisive for a data set, and use those as the bits for
doing the screening? This is a hybrid between hashed paths
and substructure keys.


Of course, hash fingerprints are better than structure keys
for doing similarity comparisons, but similarity and
substructure filtering are two different things.


In my test, I wanted to see how well linear fingerprints
work for this. While they don't capture topological
information, they are fast to generate and easy to
understand. My first step was to get an idea of the
uniqueness of linear fragments.

I used the "first_5k.smi" data set from RDKit and found all
linear paths up to length 7. For each fragment I kept track
of the molecules which contained that path. The most frequent
paths are listed below. I eliminated duplicates due to
the reversible nature of a linear path by choosing the
"smaller" of the two.

BTW, first_5k.smi has 4900 unique SMILES.

The notation for each path is
  [element number, bond, element number, bond, ... element number]
and the number afterwards is the number of times it exists. I
only show paths which are in at least 1,000 structures.


[6, 1, 6, 1, 6, 2, 8] 1022
[6, 1, 6, 1, 6, 1, 6, 1, 8] 1043
[6, 1, 6, 1, 6, '~', 6] 1047
[6, 2, 6, 1, 6, '~', 6, '~', 6, '~', 6, '~', 6] 1052
[6, 1, 6, 1, 8, 1, 6] 1081
[6, '~', 6, '~', 6, '~', 6, '~', 6, 1, 6, 1, 7] 1094
[6, 2, 6, 1, 6, '~', 6, '~', 6, '~', 6] 1100
[6, '~', 6, '~', 6, '~', 6, 1, 6, 1, 7] 1113
[6, 2, 6, 1, 6, '~', 6, '~', 6] 1118
[6, 1, 6, 1, 6, 1, 6, 1, 6, 1, 6] 1219
[6, 1, 6, 1, 6, 1, 8] 1246
[6, 1, 7, 1, 6] 1247
[6, 1, 8, 1, 6] 1284
[6, '~', 6, '~', 6, '~', 6, 1, 7] 1285
[6, '~', 6, '~', 6, 1, 7] 1285
[8, 1, 6, 2, 8] 1291
[6, '~', 6, 1, 7] 1292
[6, '~', 6, '~', 6, '~', 6, '~', 6, '~', 6, 1, 7] 1299
[6, 1, 6, 1, 7] 1306
[6, '~', 6, '~', 6, '~', 6, '~', 6, 1, 7] 1325
[6, 1, 6, 1, 6, '~', 6, '~', 6, '~', 6, '~', 6] 1508
[6, 1, 6, 1, 6, '~', 6, '~', 6, '~', 6] 1583
[6, 1, 6, 1, 6, '~', 6, '~', 6] 1597
[6, 1, 6, 1, 6, 1, 6, 1, 6] 1710
[6, 1, 6, 1, 6, 1, 6] 1878
[6, 1, 6, 1, 8] 2043
[6, 1, 6, 2, 8] 2100
[6, 1, 6, '~', 6, '~', 6] 2225
[6, 1, 6, '~', 6] 2256
[6, 2, 8] 2311
[6, 1, 7] 2341
[6, 1, 8] 2636
[6, 1, 6, 1, 6] 2820
[6, '~', 6, '~', 6, '~', 6, '~', 6, '~', 6, '~', 6] 2899
[6, 1, 6, '~', 6, '~', 6, '~', 6, '~', 6, '~', 6] 2906
[7] 2942
[6, 1, 6, '~', 6, '~', 6, '~', 6, '~', 6] 3014
[6, '~', 6, '~', 6, '~', 6, '~', 6, '~', 6] 3064
[6, 1, 6, '~', 6, '~', 6, '~', 6] 3093
[6, '~', 6, '~', 6, '~', 6, '~', 6] 3136
[6, '~', 6, '~', 6, '~', 6] 3136
[6, '~', 6, '~', 6] 3256
[6, '~', 6] 3269
[8] 3884
[6, 1, 6] 4246
[6] 4877

  Code is at http://pastebin.com/m77b09e6f


You can see that some of these are very well correlated, like

[6, '~', 6, '~', 6] 3256
[6, '~', 6] 3269

BTW, I think c~c makes a worse screen than, say, C=O because the
probability of screening goes something like:

P(1 in the query) * P(0 in the target)

and for c~c that's roughly

   0.87 * (1-0.87) = 10%

while for C=O that's roughly 25%. I think it's best to
use paths which are in about 50% of the structures.




There are 14,351 unique paths, and the count distribution is:
(The following says that there are 5974 linear paths which
are found in only 1 of the 5,000 structures.)

#structures
for a given      number of
path             path with that count
===========      =====
 1                5973
 2                1893
 3                 883
 4                 685
 5                 481
 6                 439
 7                 424
 8                 278
 9                 228
 10                231
 11                170
 12                145
 13                115
 14                113
 15                109
  [ rest omitted ]



I wrote a greedy algorithm to find the most divisive paths.

I'll define "cluster" as "set of compounds which filter the same."
This is a tree-like definition. The root is everything. For
each leaf, find the path which is closest to being present in
50% of the structures for the cluster.

  Given that path, for *every* leaf cluster make up to two children:
- one child is the cluster of compounds which are screened by that path
    - the other child contains the compounds which are not screened

The most divisive paths using that algorithm on my test set were:

largest=4900; split on [6, 1, 7]
largest=2559; split on [6, 2, 8]
largest=1299; split on [6, 1, 7, 1, 6]
largest=1290; split on [6, '~', 6, '~', 6, '~', 6, '~', 6, '~', 6, '~', 6]
largest=679; split on [6, 1, 8]
largest=548; split on [6, 1, 6, 1, 6, 1, 6, 1, 8, 1, 6]
largest=360; split on [6, 1, 6, 1, 6]
largest=276; split on [6, 1, 6, 1, 6, 1, 6, 1, 6, 2, 6]
largest=249; split on [6, 1, 6]
largest=194; split on [6, 1, 6, 1, 6, 1, 6, 1, 6, 1, 6, 1, 6]
largest=178; split on [6, 1, 8, 1, 6]
largest=132; split on [6, '~', 6, '~', 6, 1, 7, 1, 7]
largest=122; split on [7]
largest=107; split on [6, 1, 6, 1, 6, 1, 8, 1, 8, 2, 6]
largest=101; split on [6, 1, 6, 1, 6, 1, 6, '~', 6]
largest=92; split on [6, 1, 6, 1, 6, 1, 8, 1, 8]
largest=85; split on [6, 2, 6]
largest=82; split on [6, 1, 6, '~', 6, '~', 6]
largest=78; split on [6, 1, 6, 1, 6, 1, 6, 1, 6, 1, 6, 1, 8]
largest=68; split on [6, 1, 6, '~', 6, '~', 6, '~', 6, 1, 7]
largest=65; split on [8, 1, 6, 1, 6, 1, 8]
largest=63; split on [6, 1, 6, '~', 6, '~', 6, '~', 6, '~', 6, 1, 6]
largest=56; split on [6, 1, 6, 1, 6, 1, 6, 3, 6]
largest=53; split on [6, '~', 7, '~', 6]
largest=52; split on [6, 1, 17]
largest=51; split on [9, 1, 6, 1, 9]
largest=50; split on [6, 1, 35]
largest=49; split on [6, 1, 6, 1, 8]
largest=49; split on [6, 1, 6, 1, 6, 1, 6, 1, 6]
largest=49; split on [6, 1, 6, 1, 6, 1, 6, 1, 6, 1, 6]
largest=49; split on [6, 1, 6, 1, 6, 1, 6, 1, 6, 1, 8]
largest=49; split on [6, 1, 6, 1, 6, 1, 6, 1, 8]
largest=49; split on [8]
largest=49; split on [6]
largest=49; split on [6, 1, 6, 1, 6, 1, 6]
largest=49; split on [6, 1, 6, 1, 6, 1, 8]

  Code is at http://pastebin.com/d7a231217


The clusters beyond this contains 43 atoms or less, which is under 1% of the original data set.

I listed 36 paths but they were of course tuned to that data set. I tried another data set (NCI compounds 9425001-9450000) to see if the results were data-set specific. The first 36 divisive paths of that one are:

largest 21011; split on [6, '~', 6, '~', 6, 1, 6, 1, 7]
largest 10566; split on [6, '~', 6, '~', 6, 1, 7, 1, 6, 2, 8]
largest 5708; split on [6, 1, 6, 1, 6, 1, 6]
largest 3333; split on [6, 1, 6, 1, 7, 1, 6, 2, 6]
largest 2236; split on [6, '~', 6, '~', 6, '~', 6, '~', 6, 1, 6, 2, 8]
largest 1661; split on [6, 1, 6, '~', 6, '~', 6, '~', 6, '~', 6, 1, 7]
largest 1425; split on [6, 1, 6, 1, 6, 1, 6, 1, 6, 1, 6, 1, 7]
largest 1035; split on [6, 1, 6, '~', 6, '~', 6, '~', 6, 1, 7, 1, 6]
largest 997; split on [6, 1, 6, 2, 6, 1, 6, '~', 6, '~', 6, 1, 6]
largest 915; split on [6, 1, 8]
largest 501; split on [6, 1, 6, 1, 6, '~', 6, '~', 6, 1, 6]
largest 422; split on [16]
largest 322; split on [6, '~', 6, '~', 6, '~', 6, '~', 6, '~', 6, 1, 9]
largest 232; split on [6, '~', 6, '~', 6, '~', 16, '~', 6, '~', 7]
largest 219; split on [6, 1, 8, 1, 6, 1, 6, '~', 6, 1, 8, 1, 6]
largest 208; split on [6, 1, 6, 1, 6, 1, 7, 1, 6, 2, 8]
largest 173; split on [6, 1, 6, '~', 6, '~', 6, '~', 6, '~', 6, '~', 6]
largest 139; split on [6, 1, 6, '~', 6, '~', 6, 1, 6, 1, 7, 1, 6]
largest 134; split on [6, 1, 6, 1, 6, '~', 6]
largest 130; split on [1, 1, 6, 1, 6, 1, 6, 1, 6, 1, 8, 1, 6]
largest 128; split on [6, 1, 6, '~', 6, 1, 6, 2, 6]
largest 106; split on [6, '~', 6, '~', 8, 1, 8, 1, 6, 1, 6, 1, 8]
largest 102; split on [6, '~', 6, '~', 6, '~', 6, '~', 6, '~', 6, 1, 17]
largest 88; split on [1, 1, 6, 1, 7]
largest 72; split on [6, 1, 6, '~', 6, '~', 6, '~', 6, 1, 8, 1, 6]
largest 56; split on [6, 1, 6, 1, 6, 1, 6, 1, 6, 1, 6, 2, 8]
largest 55; split on [6, 1, 6, 1, 16, 1, 6, '~', 6]
largest 47; split on [6, 1, 6, 1, 8]
largest 46; split on [6, 1, 6, 1, 6, 1, 7, 1, 7, 1, 6, 2, 6]
largest 44; split on [6, 1, 6, 1, 6, 1, 8]
largest 44; split on [6, 1, 6, 1, 7, 1, 7, 1, 6, 1, 6, 2, 8]
largest 44; split on [6, 1, 6, '~', 6, 1, 7, 1, 7, 1, 6]
largest 38; split on [6, 1, 7, 1, 6, 1, 6, 1, 7, 1, 6, 2, 8]
largest 36; split on [6, 1, 6, 1, 6, '~', 6, '~', 6, '~', 6, 1, 7]
largest 35; split on [6, 1, 6, 1, 8, 1, 6, 1, 6, 1, 6, 1, 8]
largest 34; split on [6, 2, 8, '~', 6, '~', 6, '~', 6, '~', 6, 1, 8]

Of these, 4 paths are in common, which helps me think this approach
is not useful. On the other hand, there are a lot of ties and one
data set it might split on, say,
   [6, '~', 6, '~', 6, '~', 6, 1, 8]
while the other splits on
   [6, '~', 6, '~', 6, 1, 8]
in which case both may be almost equally divisive, even without
being path identical.


To figure out if this is useful requires testing, and that
requires a data set for doing benchmarks.


                                Andrew
                                da...@dalkescientific.com



Reply via email to