On Feb 13, 2009, at 12:57 PM, Andrew Dalke wrote:
I tried a smaller, 56 bit fingerprint, which on average
does not filter 258 compounds (74% filtered). Add in that
handful of extra patterns and it's 222 compounds (78% unfiltered).
I fixed a few bugs in my code. The biggest was that I got
the wrong atom types for atoms in position 5, 6, and 7.
I didn't catch that until there was an S in one of those
positions.
The other is that I now support aromatic atoms, so I
can have [c]-[c] instead of [#6]-[#6].
With those in place, using only linear fragments selected
by my greedy algorithm for a 64 bit fingerprint, I get
the following results:
Using NCI structures 9425001 to 9426001 as the training set
and "queries.txt" (derived from NCI structures 1-1001) as
the target:
average number not filtered 163.042527339 (84% filtering)
Using these 64 bits and adding a few more SMARTS/bits for ring
detection,
average number not filtered 149.111786148 (85% filtering)
Using queries.txt as the training set and the target
average number not filtered 137.420413123 (86% filtering)
(This is called "cheating")
These numbers are much better than the 80% I reported before and it's
getting to be competitive with the 91% Greg reported for the more
complex subgraph code in RDKit using 1024 bits, given that my
fingerprint only has about 64 bits and can be generated very quickly.
I need to take a break from this now, let it percolate, and clean up
the code so others can make sense of it.
BTW, if anyone is interested, the 64 patterns are:
[N]-[C]-[C]-[O]
[C]-[C]-[C]-[N]
[C]-[O]
[n]
[c]~[c]~[c]~[c]-[C]-[N]-[N]
[c]-[C]=[O]
[C]-[S]
[N]-[c]~[c]-[O]
[C]-[c]~[c]~[c]-[C]
[c]-[C]-[C]-[N]
[C]-[c]~[c]~[c]~[c]-[O]
[C]-[C]-[C]-[C]-[C]=[O]
[O]=[S]
[#17]
[C]-[C]-[C]-[O]-[c]~[c]-[O]
[#9]
[s]~[c]-[C]-[C]-[C]
[C]-[c]~[c]-[O]
[#35]
[c]-[N]
[O]=[C]-[N]-[C]=[O]
[c]=[O]
[n]-[C]-[c]
[O]-[C]-[C]-[O]
[c]-[C]-[C]
[o]
[C]-[C]-[N]-[C]-[C]-[N]
[C]-[C]-[C]-[C]-[C]-[C]
[#17]-[c]~[c]-[#17]
[s]
[C]=[C]-[C]-[O]
[c]-[C]
[N]=[O]
[S]
[C]-[c]~[c]~[c]~[c]-[C]
[C]-[c]~[c]~[c]~[c]-[S]
[c]-[c]
[C]-[c]~[c]~[c]-[O]-[C]-[C]
[C]-[C]-[O]
[N]-[C]-[C]-[C]-[N]
[C]-[N]-[C]
[c]~[n]~[c]
[N]-[C]-[N]
[C]-[c]~[c]~[c]-[#17]
[C]=[C]
[C]-[C]-[N]-[C]-[C]
[C]-[C]-[C]-[C]-[C]
[C]-[c]~[c]-[C]
[c]-[C]-[O]-[c]~[c]-[O]
[O]-[c]~[c]-[#9]
[C]-[C]-[C]-[C]
[O]-[c]~[c]-[O]
[C]-[C]-[C]
[C]-[c]~[c]~[c]-[N]
[C]-[c]~[c]-[#35]
[C]-[c]~[c]-[#9]
[C]-[c]~[c]~[c]~[c]-[#17]
[#9]-[c]~[c]-[S]
[#53]
[C]-[C]-[N]-[S]
[c]-[C]-[C]-[C]
[C]-[C]-[C]-[S]
[c]-[c]~[c]~[c]-[C]
[n]-[c]
Andrew
da...@dalkescientific.com