Is anyone here interested in evaluating my new method to emulate count 
fingerprints using binary fingerprints?

I've added that feature to chemfp5.0b2, released yesterday, but I don't have 
the expertise to evaluate its effectiveness.

In short, for most Linux-based OSes, install chemfp, generate count 
fingerprints, and convert count fingerprints to binary fingerprints using the 
following steps:

  python -m pip install chemfp==5.0b2 -i https://chemfp.com/packages/
  chemfp rdkit2fpc dataset.sdf.gz -o dataset.fpc
  chemfp fpc2fps dataset.fpc -o dataset.fps

then use chemfp's "simsearch" for similarity search of the FPS (or FPB) files, 
like:

  simsearch --query 'c1ccccc1O' -k 5 --out csv dataset.fps

The "--help" for these commands are documented at 
https://chemfp.com/docs/tool_help.html . The "FPC" format is my new text-based 
exchange format for count fingerprints, described at 
https://chemfp.com/fpc_format/ .

Here's some background.

RDKit supports several count fingerprints (Morgan, RDKit fingerprints, Atom 
Pair, and Torsion). These can be viewed as a list of (feature id, count) pairs.

By default RDKit converts these into binary fingerprints by folding the feature 
id, that is, setting the binary fingerprint bit i to 1, where i = (feature id) 
modulo fpSize. This method ignores the counts.

These fingerprint generators also implement a countSimulation method, which 
sets additional bits based on count thresholds. For example, if the countBounds 
is 1,3,9 then it sets 1 bit if the count is at least 1, two bits if the count 
is at least 3, and three bits if the count is at least 9. (The actual algorithm 
is a bit more complicated than this.)

I've come up with a new method which is a cross between Calvin Mooers' 
superimposed coding and the Daylight RNG approach.

It's based on the observation that Morgan fingerprints are typically quite 
sparse, eg, for Morgan3 count fingerprints from ChEMBL 33 the average 
fingerprint has 71 distinct features, with an average feature count of 1.5. 
That means there are on average 107 distinct possible bits to set in the output 
binary fingerprint, assuming each count sets 1 bit, eg, that feature 2246728737 
with count 2 can set 2 bits.

But how to choose those bits?

My new method uses the feature id to seed an RNG, which is then used to get 
`count` output bit positions, randomly chosen from the output fingerprint size.

    output_fp = BinaryFingerprint(num_bits)
    for feature_id, count in features:
        rng = RNG(feature_id)
        for _ in range(count):
            bitno = rng.randrange(num_bits)
            output_fp.SetOnBit(bitno)

There are a couple of tunable parameters: 1) the output fingerprint size, 2) 
the number of bits to set for each count, and 3) an upper bound for the feature 
count, so the full algorithm is a bit more complicated:

    output_fp = BinaryFingerprint(num_bits)
    for feature_id, count in features:
        rng = RNG(feature_id)
        for _ in range(min(count, max_count) * bits_per_count):
            bitno = rng.randrange(num_bits)
            output_fp.SetOnBit(bitno)

The reason for "bits_per_count" is to reduce the effect of collisions. Double 
the fingerprint size and double the count keeps the output density roughly 
unchanged, but should reduce the collision rate between two pairs of (feature 
id, specific count).

That's my hand-waving belief, but I don't have the specific experience in 
evaluating fingerprint effectiveness.

I know other RDKit users do, and might be able to help.

What I know so far is it's a bit better than RDKit's count simulation at 
predicting MW.  https://mstdn.science/@molecule/115063149386391787 :)

The "fpc2fps" command supports other methods, like "scaled", which is a cross 
between superimposed and the RDKit count simulation. Rather than use `count` 
random numbers, it takes a lookup table of count thresholds to get the actual 
repeat to use. See the fpc2fps --help-methods for more complete details, or 
contact me.

This 5.0b2 release also includes a "simhistogram" method to generate a 
histogram from all possible Tanimoto scores, a "shardsearch" method to search 
multiple target files ("shards") and merge the results, and it has a reasonably 
performant implementation of the 4860-bit Klekota-Roth fingerprint. 

See https://chemfp.com/docs/whats_new_in_50.html to learn more.

Best regards,

                                Andrew
                                da...@dalkescientific.com



_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to