On Apr 22, 2018, at 20:22, Nils Weskamp <nils.wesk...@gmail.com> wrote:
> Actually, I *was* also thinking about your use cases 2 and 3 since you
> also need some form of hash function to map substructures to bit
> numbers. This is normally a rather simple function / pseudo random
> generator, 

Strictly speaking, this is not a requirement.

The term "fingerprint" has taken on quite an encompassing meaning since 1990.

The molecular formula is a count fingerprint with 118 keys, based on the atomic 
number. There's no need for hash function there. "CCO" might be:
  [0, 0, 0, 0, 0, 2, 0, 1, ...]

Or it can be written in more compact form like {"C": 2, "O": 1}.

As an alternative, I could use a mapping from canonical substructures to 
counts, so "CCO" becomes:

  {"C": 2, "O": 1, "CC": 1, "CO": 1, "CCO": 1}

This doesn't require a hash. (While I represent that as a Python dictionary, 
which uses a hash table underneath, it could be implemented using a red-black 
tree or B-tree, or with a simple linear search.)

It's only if I want to convert this into fixed length representation where I 
have to figure out some sort of encoding scheme.

Even then, I don't need a PRNG or hash seed. Suppose I use a bit vector. I 
could have a table which maps all canonical substructures to its bit pattern. 
If I have an unknown fragment, I could use RANDOM.ORG to get the bits.

Downsides include potentially unbounded table growth and the need for a 
centralized table.

This is the approach that Zatocoding used, and I see Chemical Zatocoding as the 
only precursor to Daylight hash fingerprints.

>  which could of course also be changed to something expensive to calculate.


Yes, that could be possible. Abstractly, let the first 20 bytes of each 
fingerprint be a salt, and use something like bcrypt so each fingerprint test 
requires that the query structure be re-fingerprinted for the per-fingerprint 
hash function.

It would, however, take an absurdly long time to do a similarity search.

And in any case, before going further along that path, we would need to figure 
out the risk model. Brian started by saying that he wanted to obfuscate 
molecules for security, but didn't say what he want to use them for, and if he 
want to secure them against nation-states, or simply against me. ;)



                                Andrew
                                da...@dalkescientific.com



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to