Hi Lewis, The Morgan atom environments are hashed into an unsigned 32bit int, so the maximum value is 2^32 -1
-greg On Fri, Oct 9, 2020 at 1:18 AM Lewis Martin <lewis.marti...@gmail.com> wrote: > Hi all, > Felt sure this would have been asked but I can't find it. What is the > 'largest' possible bit in an unfolded Morgan fingerprint? Asked another > way, what type of number are the substructure identities hashed into? > > The Rogers and Hahn ECFP paper says that they hash into a 32-bit integer, > and in the paper they use negative and positive values. > > Since hashing generates bits with mostly uniform density, I tried sampling > some fingerprints. Testing a few hundred thousand molecules, the largest > bit I found was suspiciously close to 2X larger than the maximum > expressible number for a 32-bit integer. So I guess that, to be consistent > with Rogers and Hahn the bits are hashed into 32-bit integers, but then > they are shifted to be positive? Is that correct? > > Thanks :) hope the UGM went well. > Lewis > > PS context is I saw a weird result where prediction scores kept getting > higher when I used larger fingerprints. At size 8192 I ran out of memory, > so I'm moving to sparse representation (possibly unfolded) but I don't know > how big the sparse matrix should be. > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss