Dear colleague, In learning about the Morgan fingerprint, I encountered the following two questions. I have read the standard reference paper by Rogers, D. & Hahn, M. in J. Chem. Inf. Model. 50, 742–754 (2010).
1) What is the nature of the hashing function? What does RDkit use as the hash function for morgan fingerprint? A sentence in the paper says "What is most important is to have the hash function map arrays of integers randomly and uniformly into the 2^32-size space of all possible integers". If I understand hashing correctly, it is a deterministic process, in a sense that as long as the input arrays are the same, the output integers will be the same. In other words, there are no random variables involved in the generation of the integer, so the word "randomly" in the sentence of the paper appear to be a bit confusing. Is my understanding correct? 2) Converting from the integer fingerprint (obtained using AllChem.GetMorganFingerprint) to the explicit bit vector (obtained using AllChem.GetMorganFingerprintAsBitVect) appears to be a simple mod operation. This could also leads to possible duplicates. Is the reason for this conversion to facilitate the use of morgan fingerprints as a fixed length input for machine learning applications? Thank you for your help in advance. Best wishes, Wendong _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss