Re: [Rdkit-discuss] Some basic questions about binary fingerprints

Nils Weskamp Sat, 09 Jan 2021 03:13:14 -0800

Dear Jan,

you are probably right. If you have about 2/3 of your 10k bits set toone, doesn't that imply the probability of a collision for any newfragment is roughly 2/3 (which fits to the 5 of 7 you observe in yourexample)?

Concerning your second question: Just as any other descriptor, foldedfingerprints certainly have their limitations (keep in mind they wereoriginally developed for a different purpose, fast substructure- andsimilarity searches). Your example nicely illustrates the need to usevery robust ML techniques or to conduct feature selection for thesekinds of descriptors. I would expect that e.g. a random forest learnerwill detect such confounding effects if they are relevant and pickdifferent bits. Most parts of a molecule are covered by multiplefragments, so there should be enough alternatives.


Hope this helps,
Nils

P.S. In your mail, you talk about ECFP4s, but in the code, it looks likeyou are using a radius of 2 - or am I wrong? I doubt it will make adifference in general.


Am 09.01.2021 um 10:41 schrieb Jan Halborg Jensen:

I am trying to relate the reliability of ML models trained using binaryfingerprint to the presence of on-bits, i.e. comparing the on-bits in amolecule in the test set to the on-bits in the training set. But I amgetting some strange results
The code is here so I will just summarise.https://colab.research.google.com/drive/19uYGmnsL8hM0JWFTMd1dfQLkHA4JqFik?usp=sharing<https://colab.research.google.com/drive/19uYGmnsL8hM0JWFTMd1dfQLkHA4JqFik?usp=sharing>
I pick 1000 random molecules from ZINC as my "training set” and computeECFP4 fingerprints using nBits=10_000. There are 6226 unique on-bits. Iuse nBits=10_000 to try to avoid collisions
Then I compute the on-bits for a molecule (“test set") that is verydifferent from any in the “training set" (OOOOO) and compare them to the 6226 unique on-bits on ZINC. Of the 7 on-bits for OOOOO shown here,five are found among the 6226 “training set" on-bits: 656, 2311, 4453,4487, 8550
However, 656, 4453, and 8550 corresponds to different fragments for the"training set".
The only reason I can think of is bit-collisions in the hashing, butthere are 10000-6226 = 3774 unused positions.
Is there any other explanation? If not, what does that say about usingbit vectors (especially the usual nBits = 2048) as descriptors?
Any insight is greatly appreciated.

Best regards, Jan


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss




_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Some basic questions about binary fingerprints

Reply via email to