Dear Jan,
you are probably right. If you have about 2/3 of your 10k bits set to
one, doesn't that imply the probability of a collision for any new
fragment is roughly 2/3 (which fits to the 5 of 7 you observe in your
example)?
Concerning your second question: Just as any other descriptor, folded
fingerprints certainly have their limitations (keep in mind they were
originally developed for a different purpose, fast substructure- and
similarity searches). Your example nicely illustrates the need to use
very robust ML techniques or to conduct feature selection for these
kinds of descriptors. I would expect that e.g. a random forest learner
will detect such confounding effects if they are relevant and pick
different bits. Most parts of a molecule are covered by multiple
fragments, so there should be enough alternatives.
Hope this helps,
Nils
P.S. In your mail, you talk about ECFP4s, but in the code, it looks like
you are using a radius of 2 - or am I wrong? I doubt it will make a
difference in general.
Am 09.01.2021 um 10:41 schrieb Jan Halborg Jensen:
I am trying to relate the reliability of ML models trained using binary
fingerprint to the presence of on-bits, i.e. comparing the on-bits in a
molecule in the test set to the on-bits in the training set. But I am
getting some strange results
The code is here so I will just summarise.
https://colab.research.google.com/drive/19uYGmnsL8hM0JWFTMd1dfQLkHA4JqFik?usp=sharing
<https://colab.research.google.com/drive/19uYGmnsL8hM0JWFTMd1dfQLkHA4JqFik?usp=sharing>
I pick 1000 random molecules from ZINC as my "training set” and compute
ECFP4 fingerprints using nBits=10_000. There are 6226 unique on-bits. I
use nBits=10_000 to try to avoid collisions
Then I compute the on-bits for a molecule (“test set") that is very
different from any in the “training set" (OOOOO) and compare them to the
6226 unique on-bits on ZINC. Of the 7 on-bits for OOOOO shown here,
five are found among the 6226 “training set" on-bits: 656, 2311, 4453,
4487, 8550
However, 656, 4453, and 8550 corresponds to different fragments for the
"training set".
The only reason I can think of is bit-collisions in the hashing, but
there are 10000-6226 = 3774 unused positions.
Is there any other explanation? If not, what does that say about using
bit vectors (especially the usual nBits = 2048) as descriptors?
Any insight is greatly appreciated.
Best regards, Jan
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss