[Rdkit-discuss] hash function used in the generation of morgan fingerprint

Wendong Wang Wed, 01 Mar 2023 17:42:03 -0800

Dear colleague,
In learning about the Morgan fingerprint, I encountered the following two 
questions. I have read the standard reference paper by Rogers, D. & Hahn, M. in 
J. Chem. Inf. Model. 50, 742–754 (2010).


1) What is the nature of the hashing function? What does RDkit use as the hash 
function for morgan fingerprint? A sentence in the paper says "What is most 
important is to have the hash function map arrays of integers randomly and 
uniformly into the 2^32-size space of all possible integers". If I understand 
hashing correctly, it is a deterministic process, in a sense that as long as 
the input arrays are the same, the output integers will be the same. In other 
words, there are no random variables involved in the generation of the integer, 
so the word "randomly" in the sentence of the paper appear to be a bit 
confusing. Is my understanding correct?

2) Converting from the integer fingerprint (obtained using 
AllChem.GetMorganFingerprint) to the explicit bit vector (obtained using 
AllChem.GetMorganFingerprintAsBitVect) appears to be a simple mod operation. 
This could also leads to possible duplicates. Is the reason for this conversion 
to facilitate the use of morgan fingerprints as a fixed length input for 
machine learning applications?

Thank you for your help in advance.

Best wishes,
Wendong


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

[Rdkit-discuss] hash function used in the generation of morgan fingerprint

Reply via email to