On Fri, Feb 13, 2009 at 5:16 AM, Andrew Dalke <da...@dalkescientific.com> wrote:
>
> Substructure keys are based on human ideas of what's relevant.
> I implemented a subset of the CACTVS keys and noticed that
> they are, well, rather sparse. Looking now at an PubChem data
> set I see about 70% of the bits are 0. There's also very
> good correlation between some of the bits (if there are 4
> carbons then there are always 2 carbons).
>
> Also, very few of the compounds have, say, Dysprosium
> (CACTVS bit number 108) so that's information perhaps
> best stored in an inverted index. I started something
> along this line a few years ago using Judy trees ...
> Hmm... should think more about this ...

The CACTVS thing seems like a wasted bit; for searching organic
molecules, one can just include a "other atom type" bit to cover these
rare cases. I'd guess it would be hit rarely enough in searches that
collisions caused by redundancy wouldn't matter.

I don't know Judy trees. Do you have a reference/pointer?
They aren't by any chance connected to the thing presented in Andrew
Smellie's recent paper (haven't read it yet)?
http://pubs.acs.org/doi/abs/10.1021/ci800325v

>
>
> I took an idea from clustering. What about generating
> paths for a set of compounds to see which are the most
> divisive for a data set, and use those as the bits for
> doing the screening? This is a hybrid between hashed paths
> and substructure keys.
>
>
> Of course, hash fingerprints are better than structure keys
> for doing similarity comparisons, but similarity and
> substructure filtering are two different things.

So far I agree with most everything. Hashes are good for similarity,
correctly chosen structural keys are good for substructure searching.

Finding some way to automatically generate substructure keys for
effective searching would be effective.

> In my test, I wanted to see how well linear fingerprints
> work for this. While they don't capture topological
> information, they are fast to generate and easy to
> understand. My first step was to get an idea of the
> uniqueness of linear fragments.

I think it's worth looking into branched paths as well for real
substructure searches. People don't query with linear fragments all
that often, so it seems like it would be a win.

> I used the "first_5k.smi" data set from RDKit and found all
> linear paths up to length 7. For each fragment I kept track
> of the molecules which contained that path. The most frequent
> paths are listed below. I eliminated duplicates due to
> the reversible nature of a linear path by choosing the
> "smaller" of the two.
  [snip]
> BTW, I think c~c makes a worse screen than, say, C=O because the
> probability of screening goes something like:
>
> P(1 in the query) * P(0 in the target)
>
> and for c~c that's roughly
>
>    0.87 * (1-0.87) = 10%
>
> while for C=O that's roughly 25%. I think it's best to
> use paths which are in about 50% of the structures.
>

I keep wanting to argue with this, but as long as the bits are
independent, it's clearly correct. If the bits aren't independent,
then it breaks down, but I can't come up with a clear model of how.
(or how to work around that)

[snip]

> To figure out if this is useful requires testing, and that
> requires a data set for doing benchmarks.

Yeah, and if nothing else we will provide a service to the community
by creating such a dataset (or datasets) and a discussion around its
composition.

-greg

Reply via email to