Hi all,
Felt sure this would have been asked but I can't find it. What is the
'largest' possible bit in an unfolded Morgan fingerprint? Asked another
way, what type of number are the substructure identities hashed into?

The Rogers and Hahn ECFP paper says that they hash into a 32-bit integer,
and in the paper they use negative and positive values.

Since hashing generates bits with mostly uniform density, I tried sampling
some fingerprints. Testing a few hundred thousand molecules, the largest
bit I found was suspiciously close to 2X larger than the maximum
expressible number for a 32-bit integer. So I guess that, to be consistent
with Rogers and Hahn the bits are hashed into 32-bit integers, but then
they are shifted to be positive? Is that correct?

Thanks :) hope the UGM went well.
Lewis

PS context is I saw a weird result where prediction scores kept getting
higher when I used larger fingerprints. At size 8192 I ran out of memory,
so I'm moving to sparse representation (possibly unfolded) but I don't know
how big the sparse matrix should be.
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to