Hi all,

  Does anyone here have experience in using different values for the 
numBitsPerFeature parameter of the RDKit fingerprint generator, or can point me 
to a publication exploring that parameter? I suspect it's not that useful, and 
the default should be 1 instead of 2.

Quoting the documentation, numBitsPerFeature sets "the number of bits set per 
path/subgraph found".

As I understand the history, this parameter derives from the Daylight 
documentation, at 
https://www.daylight.com/dayhtml/doc/theory/theory.finger.html , which says:

"Instead, each pattern serves as a seed to a pseudo-random number generator (it 
is "hashed"), the output of which is a set of bits (typically 4 or 5 bits per 
pattern);"

I've been working on a related topic - count emulation using binary 
fingerprints. For each count C and fingerprint size N I select a random number 
in the range 0..N-1 (ie, randrange(N)) and set the corresponding bit to 1; 
repeated C times.

I thought the numBitsPerFeature equivalent would be useful, that is, repeat the 
sampling numBitsPerFeature*C times. I thought this would be more likely to 
identify near neighbors as it would increase the number of shared bits between 
two similar fingerprints.

I tested my method against the exact solution. I found that numBitsPerFeature 
was not useful. That is, numBitsPerFeature=1 for a given N was essentially 
always better than numBitsPerFeature=2 for the same number of bits N.

I did find that numBitsPerFeature=2 for 2*N bits was slightly better than 
numBitsPerFeature=1 for N bits, but again numBitsPerFeature=1 for 2*N bits was 
still better than numBitsPerFeature=2 for 2*N bits.

(See my preprint at https://chemrxiv.org/doi/full/10.26434/chemrxiv-2026-j3hbj )

I tried to figure this out mathematically. My simple attempt says the 
numBitsPerFeature shouldn't affect things at all. In short, if the original 
fingerprints have A and B features, C features in common, with A and B are much 
less than N, then the number of bits set by the fingerprints is approximately 
f(k) = N(1-exp(-k/N)), i.e. a = N(1-exp(-A/N)) and b = N(1-exp(-A/N)). This 
formula related to the Birthday Problem. 

If we assume C maps the same way then the Tanimoto is

   T(fp_A, fp_B) = c / (a + b - c)

   T(fp_A, fp_B) = N(1-exp(-C/N)) /
      ((N(1-exp(-A/N))  + N(1-exp(-B/N)) - N(1-exp(-C/N)))

If the number of bits per feature is doubled, and the number of bits also 
doubled, then the Tanimoto score is unchanged because the ratio 2*k/2*N stays 
constant.

However, I don't think c (which is the number of bits in common as measured in 
the final fingerprints) is correctly computed as f(C) because of the higher 
chance of coincidental overlap with portions of (A-C) and (B-C). This analysis, 
alas, is beyond my mathematical abilities.

Still, my simulations suggest that setting more than one bit per feature isn't 
that useful.

I suspect this same conclusion would hold with the RDKit fingerprint generator, 
that is, I suspect numBitsPerFeature=1 would give slightly more accurate 
matches than numBitsPerFeature=2. Furthermore, it would improve the accuracy 
for the current default of 2048 bits, and the 1024-bit version would be almost 
as good as the current 2048 bits.

I'll add that the RDKit fingerprint generator is used for similarity, while the 
Daylight fingerprints were also used as substructure search screens. In the 
latter, the number of bits affects screenout, for information content reasons. 
I've been told the Daylight fingerprint set a different number of bits 
depending on the fingerprint length.


Best regards,

                                Andrew
                                [email protected]





_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to