Hi all, Does anyone here have experience in using different values for the numBitsPerFeature parameter of the RDKit fingerprint generator, or can point me to a publication exploring that parameter? I suspect it's not that useful, and the default should be 1 instead of 2.
Quoting the documentation, numBitsPerFeature sets "the number of bits set per path/subgraph found". As I understand the history, this parameter derives from the Daylight documentation, at https://www.daylight.com/dayhtml/doc/theory/theory.finger.html , which says: "Instead, each pattern serves as a seed to a pseudo-random number generator (it is "hashed"), the output of which is a set of bits (typically 4 or 5 bits per pattern);" I've been working on a related topic - count emulation using binary fingerprints. For each count C and fingerprint size N I select a random number in the range 0..N-1 (ie, randrange(N)) and set the corresponding bit to 1; repeated C times. I thought the numBitsPerFeature equivalent would be useful, that is, repeat the sampling numBitsPerFeature*C times. I thought this would be more likely to identify near neighbors as it would increase the number of shared bits between two similar fingerprints. I tested my method against the exact solution. I found that numBitsPerFeature was not useful. That is, numBitsPerFeature=1 for a given N was essentially always better than numBitsPerFeature=2 for the same number of bits N. I did find that numBitsPerFeature=2 for 2*N bits was slightly better than numBitsPerFeature=1 for N bits, but again numBitsPerFeature=1 for 2*N bits was still better than numBitsPerFeature=2 for 2*N bits. (See my preprint at https://chemrxiv.org/doi/full/10.26434/chemrxiv-2026-j3hbj ) I tried to figure this out mathematically. My simple attempt says the numBitsPerFeature shouldn't affect things at all. In short, if the original fingerprints have A and B features, C features in common, with A and B are much less than N, then the number of bits set by the fingerprints is approximately f(k) = N(1-exp(-k/N)), i.e. a = N(1-exp(-A/N)) and b = N(1-exp(-A/N)). This formula related to the Birthday Problem. If we assume C maps the same way then the Tanimoto is T(fp_A, fp_B) = c / (a + b - c) T(fp_A, fp_B) = N(1-exp(-C/N)) / ((N(1-exp(-A/N)) + N(1-exp(-B/N)) - N(1-exp(-C/N))) If the number of bits per feature is doubled, and the number of bits also doubled, then the Tanimoto score is unchanged because the ratio 2*k/2*N stays constant. However, I don't think c (which is the number of bits in common as measured in the final fingerprints) is correctly computed as f(C) because of the higher chance of coincidental overlap with portions of (A-C) and (B-C). This analysis, alas, is beyond my mathematical abilities. Still, my simulations suggest that setting more than one bit per feature isn't that useful. I suspect this same conclusion would hold with the RDKit fingerprint generator, that is, I suspect numBitsPerFeature=1 would give slightly more accurate matches than numBitsPerFeature=2. Furthermore, it would improve the accuracy for the current default of 2048 bits, and the 1024-bit version would be almost as good as the current 2048 bits. I'll add that the RDKit fingerprint generator is used for similarity, while the Daylight fingerprints were also used as substructure search screens. In the latter, the number of bits affects screenout, for information content reasons. I've been told the Daylight fingerprint set a different number of bits depending on the fingerprint length. Best regards, Andrew [email protected] _______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

