On Feb 13, 2009, at 6:41 AM, Greg Landrum wrote:
I'm leaving for vacation this morning and have limited time, so I'm
going to just attach my test data. The rest I'm really looking forward
to spending some time with later. I will have email access while gone,
but I won't be as responsive as normal.

No worries.

I have preliminary conclusions for my approach. Summary: it looks
like a small fingerprint (64 bits) can filter 80% of the data set,
but linear paths isn't going to get better than that.

I used my top 271 bits to encode fingerprints for the queries and
targets and found that on average my query fingerprints only
screened the data set from 1000 compounds down to 222. That's 78%
filtered compared to what Greg found:

Experiment 2: reduce fps to 512 bits
RDK fingerprints: filter out 441529 (54%)
Layered: filter out 710647 (86%)
10-15% faster


I'm regenerating my data set to have 512 bits so I can do a
more direct comparison.  ... Done. Interesting. It doesn't
change the result. I still filter out only 78% of the structures.

Now I'll try training based on the query structures. That's
cheating, but I'm curious ... Also interesting. This is
a 186 bit fingerprint, and the average number of compounds
which are not filtered is .. 221! (78% filtered)

To that I added a few hand-coded patterns:
  C=C, C#C, *1**1, *1***1, *1****1, *1*****1, *1******1
and got 197 (80% filtered).

I tried a smaller, 56 bit fingerprint, which on average
does not filter 258 compounds (74% filtered). Add in that
handful of extra patterns and it's 222 compounds (78% unfiltered).

Code is at http://pastebin.com/m45f6ebd9


The numbers I get are a pretty solid 80% filter rate.
I can get that even with a 64 bit fingerprint, which is
interesting because that can be stored in a single long int.

I can do a bit better during path selection by breaking
ties (for paths which are closest to being in 1/2 of the
structures in a cluster) based on using the path with the
shorter path length. I also tried using 1/3 instead of 1/2
but I don't think there's a big difference. I wasn't
rigorous though.




The worst queries, btw, are:

  O = 989 unfiltered
  CC#C = 988 unfiltered (I have no fingerprint with a "#")
c1ccoc1 = 986 unfiltered (I don't handle cycles). Note: two occurrences:
      [H]c1ccoc1 on line 46
      [H]c1occc1 on line 299
  c1ccccc1 = 975 unfiltered (again, no cycles)  Note: two occurrences
      [H]c1cccc([H])c1 on line 53
      [H]c1ccccc1 on line 708

Hmm. There are many duplicates in the queries:

(columns are "number of compounds not filtered", "canonicalized query pattern")

915 c1ccncc1
915 c1ccncc1
915 c1ccncc1
915 c1ccncc1
915 c1ccncc1
918 Cc1c[nH]cn1
918 Cc1c[nH]nn1
918 Cc1c[nH]nn1
925 CC=NN
927 c1c2c([nH]cn2)ncn1
927 c1c2c(nc[nH]2)ncn1
927 c1cncnc1
927 c1cncnc1
927 c1cncnc1
928 N
928 NN
928 NN
928 c1[nH]nnn1
928 c1cnc[nH]1
928 c1cnccn1
928 c1cnn[nH]1
928 c1ncncn1
975 c1ccccc1
975 c1ccccc1
986 c1ccoc1
986 c1ccoc1
988 CC#C
989 O

To continue in this path I would need to rerun things so I generate fragments in query space but filter them in target space, using the greedy algorithm to pick fragments which increase the ability to filter. That'll have to be some other day.


                                Andrew
                                da...@dalkescientific.com



Reply via email to