Is anyone here interested in evaluating my new method to emulate count fingerprints using binary fingerprints?
I've added that feature to chemfp5.0b2, released yesterday, but I don't have the expertise to evaluate its effectiveness. In short, for most Linux-based OSes, install chemfp, generate count fingerprints, and convert count fingerprints to binary fingerprints using the following steps: python -m pip install chemfp==5.0b2 -i https://chemfp.com/packages/ chemfp rdkit2fpc dataset.sdf.gz -o dataset.fpc chemfp fpc2fps dataset.fpc -o dataset.fps then use chemfp's "simsearch" for similarity search of the FPS (or FPB) files, like: simsearch --query 'c1ccccc1O' -k 5 --out csv dataset.fps The "--help" for these commands are documented at https://chemfp.com/docs/tool_help.html . The "FPC" format is my new text-based exchange format for count fingerprints, described at https://chemfp.com/fpc_format/ . Here's some background. RDKit supports several count fingerprints (Morgan, RDKit fingerprints, Atom Pair, and Torsion). These can be viewed as a list of (feature id, count) pairs. By default RDKit converts these into binary fingerprints by folding the feature id, that is, setting the binary fingerprint bit i to 1, where i = (feature id) modulo fpSize. This method ignores the counts. These fingerprint generators also implement a countSimulation method, which sets additional bits based on count thresholds. For example, if the countBounds is 1,3,9 then it sets 1 bit if the count is at least 1, two bits if the count is at least 3, and three bits if the count is at least 9. (The actual algorithm is a bit more complicated than this.) I've come up with a new method which is a cross between Calvin Mooers' superimposed coding and the Daylight RNG approach. It's based on the observation that Morgan fingerprints are typically quite sparse, eg, for Morgan3 count fingerprints from ChEMBL 33 the average fingerprint has 71 distinct features, with an average feature count of 1.5. That means there are on average 107 distinct possible bits to set in the output binary fingerprint, assuming each count sets 1 bit, eg, that feature 2246728737 with count 2 can set 2 bits. But how to choose those bits? My new method uses the feature id to seed an RNG, which is then used to get `count` output bit positions, randomly chosen from the output fingerprint size. output_fp = BinaryFingerprint(num_bits) for feature_id, count in features: rng = RNG(feature_id) for _ in range(count): bitno = rng.randrange(num_bits) output_fp.SetOnBit(bitno) There are a couple of tunable parameters: 1) the output fingerprint size, 2) the number of bits to set for each count, and 3) an upper bound for the feature count, so the full algorithm is a bit more complicated: output_fp = BinaryFingerprint(num_bits) for feature_id, count in features: rng = RNG(feature_id) for _ in range(min(count, max_count) * bits_per_count): bitno = rng.randrange(num_bits) output_fp.SetOnBit(bitno) The reason for "bits_per_count" is to reduce the effect of collisions. Double the fingerprint size and double the count keeps the output density roughly unchanged, but should reduce the collision rate between two pairs of (feature id, specific count). That's my hand-waving belief, but I don't have the specific experience in evaluating fingerprint effectiveness. I know other RDKit users do, and might be able to help. What I know so far is it's a bit better than RDKit's count simulation at predicting MW. https://mstdn.science/@molecule/115063149386391787 :) The "fpc2fps" command supports other methods, like "scaled", which is a cross between superimposed and the RDKit count simulation. Rather than use `count` random numbers, it takes a lookup table of count thresholds to get the actual repeat to use. See the fpc2fps --help-methods for more complete details, or contact me. This 5.0b2 release also includes a "simhistogram" method to generate a histogram from all possible Tanimoto scores, a "shardsearch" method to search multiple target files ("shards") and merge the results, and it has a reasonably performant implementation of the 4860-bit Klekota-Roth fingerprint. See https://chemfp.com/docs/whats_new_in_50.html to learn more. Best regards, Andrew da...@dalkescientific.com _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss