On Apr 22, 2018, at 20:22, Nils Weskamp <nils.wesk...@gmail.com> wrote: > Actually, I *was* also thinking about your use cases 2 and 3 since you > also need some form of hash function to map substructures to bit > numbers. This is normally a rather simple function / pseudo random > generator,
Strictly speaking, this is not a requirement. The term "fingerprint" has taken on quite an encompassing meaning since 1990. The molecular formula is a count fingerprint with 118 keys, based on the atomic number. There's no need for hash function there. "CCO" might be: [0, 0, 0, 0, 0, 2, 0, 1, ...] Or it can be written in more compact form like {"C": 2, "O": 1}. As an alternative, I could use a mapping from canonical substructures to counts, so "CCO" becomes: {"C": 2, "O": 1, "CC": 1, "CO": 1, "CCO": 1} This doesn't require a hash. (While I represent that as a Python dictionary, which uses a hash table underneath, it could be implemented using a red-black tree or B-tree, or with a simple linear search.) It's only if I want to convert this into fixed length representation where I have to figure out some sort of encoding scheme. Even then, I don't need a PRNG or hash seed. Suppose I use a bit vector. I could have a table which maps all canonical substructures to its bit pattern. If I have an unknown fragment, I could use RANDOM.ORG to get the bits. Downsides include potentially unbounded table growth and the need for a centralized table. This is the approach that Zatocoding used, and I see Chemical Zatocoding as the only precursor to Daylight hash fingerprints. > which could of course also be changed to something expensive to calculate. Yes, that could be possible. Abstractly, let the first 20 bytes of each fingerprint be a salt, and use something like bcrypt so each fingerprint test requires that the query structure be re-fingerprinted for the per-fingerprint hash function. It would, however, take an absurdly long time to do a similarity search. And in any case, before going further along that path, we would need to figure out the risk model. Brian started by saying that he wanted to obfuscate molecules for security, but didn't say what he want to use them for, and if he want to secure them against nation-states, or simply against me. ;) Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss