Dear Jan,

you are probably right. If you have about 2/3 of your 10k bits set to one, doesn't that imply the probability of a collision for any new fragment is roughly 2/3 (which fits to the 5 of 7 you observe in your example)?

Concerning your second question: Just as any other descriptor, folded fingerprints certainly have their limitations (keep in mind they were originally developed for a different purpose, fast substructure- and similarity searches). Your example nicely illustrates the need to use very robust ML techniques or to conduct feature selection for these kinds of descriptors. I would expect that e.g. a random forest learner will detect such confounding effects if they are relevant and pick different bits. Most parts of a molecule are covered by multiple fragments, so there should be enough alternatives.

Hope this helps,
Nils

P.S. In your mail, you talk about ECFP4s, but in the code, it looks like you are using a radius of 2 - or am I wrong? I doubt it will make a difference in general.

Am 09.01.2021 um 10:41 schrieb Jan Halborg Jensen:
I am trying to relate the reliability of ML models trained using binary fingerprint to the presence of on-bits, i.e. comparing the on-bits in a molecule in the test set to the on-bits in the training set. But I am getting some strange results

The code is here so I will just summarise. https://colab.research.google.com/drive/19uYGmnsL8hM0JWFTMd1dfQLkHA4JqFik?usp=sharing <https://colab.research.google.com/drive/19uYGmnsL8hM0JWFTMd1dfQLkHA4JqFik?usp=sharing>

I pick 1000 random molecules from ZINC as my "training set” and compute ECFP4 fingerprints using nBits=10_000. There are 6226 unique on-bits. I use nBits=10_000 to try to avoid collisions

Then I compute the on-bits for a molecule (“test set") that is very different from any in the “training set" (OOOOO) and compare them to the  6226 unique on-bits on ZINC. Of the 7 on-bits for OOOOO shown here, five are found among the 6226 “training set" on-bits: 656, 2311, 4453, 4487, 8550

However, 656, 4453, and 8550 corresponds to different fragments for the "training set".

The only reason I can think of is bit-collisions in the hashing, but there are 10000-6226 = 3774 unused positions.

Is there any other explanation? If not, what does that say about using bit vectors (especially the usual nBits = 2048) as descriptors?

Any insight is greatly appreciated.

Best regards, Jan


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss




_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to