Hi all, Felt sure this would have been asked but I can't find it. What is the 'largest' possible bit in an unfolded Morgan fingerprint? Asked another way, what type of number are the substructure identities hashed into?
The Rogers and Hahn ECFP paper says that they hash into a 32-bit integer, and in the paper they use negative and positive values. Since hashing generates bits with mostly uniform density, I tried sampling some fingerprints. Testing a few hundred thousand molecules, the largest bit I found was suspiciously close to 2X larger than the maximum expressible number for a 32-bit integer. So I guess that, to be consistent with Rogers and Hahn the bits are hashed into 32-bit integers, but then they are shifted to be positive? Is that correct? Thanks :) hope the UGM went well. Lewis PS context is I saw a weird result where prediction scores kept getting higher when I used larger fingerprints. At size 8192 I ran out of memory, so I'm moving to sparse representation (possibly unfolded) but I don't know how big the sparse matrix should be.
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss