On Feb 13, 2009, at 12:57 PM, Andrew Dalke wrote:
I tried a smaller, 56 bit fingerprint, which on average
does not filter 258 compounds (74% filtered). Add in that
handful of extra patterns and it's 222 compounds (78% unfiltered).

I fixed a few bugs in my code. The biggest was that I got
the wrong atom types for atoms in position 5, 6, and 7.
I didn't catch that until there was an S in one of those
positions.

The other is that I now support aromatic atoms, so I
can have [c]-[c] instead of [#6]-[#6].

With those in place, using only linear fragments selected
by my greedy algorithm for a 64 bit fingerprint, I get
the following results:



Using NCI structures 9425001 to 9426001 as the training set
and "queries.txt" (derived from NCI structures 1-1001) as
the target:

     average number not filtered 163.042527339  (84% filtering)


Using these 64 bits and adding a few more SMARTS/bits for ring detection,

     average number not filtered 149.111786148  (85% filtering)

Using queries.txt as the training set and the target
     average number not filtered 137.420413123  (86% filtering)
(This is called "cheating")



These numbers are much better than the 80% I reported before and it's getting to be competitive with the 91% Greg reported for the more complex subgraph code in RDKit using 1024 bits, given that my fingerprint only has about 64 bits and can be generated very quickly.

I need to take a break from this now, let it percolate, and clean up the code so others can make sense of it.

BTW, if anyone is interested, the 64 patterns are:

[N]-[C]-[C]-[O]
[C]-[C]-[C]-[N]
[C]-[O]
[n]
[c]~[c]~[c]~[c]-[C]-[N]-[N]
[c]-[C]=[O]
[C]-[S]
[N]-[c]~[c]-[O]
[C]-[c]~[c]~[c]-[C]
[c]-[C]-[C]-[N]
[C]-[c]~[c]~[c]~[c]-[O]
[C]-[C]-[C]-[C]-[C]=[O]
[O]=[S]
[#17]
[C]-[C]-[C]-[O]-[c]~[c]-[O]
[#9]
[s]~[c]-[C]-[C]-[C]
[C]-[c]~[c]-[O]
[#35]
[c]-[N]
[O]=[C]-[N]-[C]=[O]
[c]=[O]
[n]-[C]-[c]
[O]-[C]-[C]-[O]
[c]-[C]-[C]
[o]
[C]-[C]-[N]-[C]-[C]-[N]
[C]-[C]-[C]-[C]-[C]-[C]
[#17]-[c]~[c]-[#17]
[s]
[C]=[C]-[C]-[O]
[c]-[C]
[N]=[O]
[S]
[C]-[c]~[c]~[c]~[c]-[C]
[C]-[c]~[c]~[c]~[c]-[S]
[c]-[c]
[C]-[c]~[c]~[c]-[O]-[C]-[C]
[C]-[C]-[O]
[N]-[C]-[C]-[C]-[N]
[C]-[N]-[C]
[c]~[n]~[c]
[N]-[C]-[N]
[C]-[c]~[c]~[c]-[#17]
[C]=[C]
[C]-[C]-[N]-[C]-[C]
[C]-[C]-[C]-[C]-[C]
[C]-[c]~[c]-[C]
[c]-[C]-[O]-[c]~[c]-[O]
[O]-[c]~[c]-[#9]
[C]-[C]-[C]-[C]
[O]-[c]~[c]-[O]
[C]-[C]-[C]
[C]-[c]~[c]~[c]-[N]
[C]-[c]~[c]-[#35]
[C]-[c]~[c]-[#9]
[C]-[c]~[c]~[c]~[c]-[#17]
[#9]-[c]~[c]-[S]
[#53]
[C]-[C]-[N]-[S]
[c]-[C]-[C]-[C]
[C]-[C]-[C]-[S]
[c]-[c]~[c]~[c]-[C]
[n]-[c]


                                Andrew
                                da...@dalkescientific.com



Reply via email to