Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-22 Thread Andrew Dalke
On Apr 22, 2018, at 20:22, Nils Weskamp  wrote:
> Actually, I *was* also thinking about your use cases 2 and 3 since you
> also need some form of hash function to map substructures to bit
> numbers. This is normally a rather simple function / pseudo random
> generator, 

Strictly speaking, this is not a requirement.

The term "fingerprint" has taken on quite an encompassing meaning since 1990.

The molecular formula is a count fingerprint with 118 keys, based on the atomic 
number. There's no need for hash function there. "CCO" might be:
  [0, 0, 0, 0, 0, 2, 0, 1, ...]

Or it can be written in more compact form like {"C": 2, "O": 1}.

As an alternative, I could use a mapping from canonical substructures to 
counts, so "CCO" becomes:

  {"C": 2, "O": 1, "CC": 1, "CO": 1, "CCO": 1}

This doesn't require a hash. (While I represent that as a Python dictionary, 
which uses a hash table underneath, it could be implemented using a red-black 
tree or B-tree, or with a simple linear search.)

It's only if I want to convert this into fixed length representation where I 
have to figure out some sort of encoding scheme.

Even then, I don't need a PRNG or hash seed. Suppose I use a bit vector. I 
could have a table which maps all canonical substructures to its bit pattern. 
If I have an unknown fragment, I could use RANDOM.ORG to get the bits.

Downsides include potentially unbounded table growth and the need for a 
centralized table.

This is the approach that Zatocoding used, and I see Chemical Zatocoding as the 
only precursor to Daylight hash fingerprints.

>  which could of course also be changed to something expensive to calculate.


Yes, that could be possible. Abstractly, let the first 20 bytes of each 
fingerprint be a salt, and use something like bcrypt so each fingerprint test 
requires that the query structure be re-fingerprinted for the per-fingerprint 
hash function.

It would, however, take an absurdly long time to do a similarity search.

And in any case, before going further along that path, we would need to figure 
out the risk model. Brian started by saying that he wanted to obfuscate 
molecules for security, but didn't say what he want to use them for, and if he 
want to secure them against nation-states, or simply against me. ;)



Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-22 Thread Nils Weskamp
Hi Andrew,

Am 22.04.2018 um 19:35 schrieb Andrew Dalke:
> I think of what I did here as a bit more elegant than that. ;)

I should have have looked at the code more carefully before commenting.
;) Nevertheless, you will probably still need many steps for complex
structures - although not as many as I anticipated.

Actually, I *was* also thinking about your use cases 2 and 3 since you
also need some form of hash function to map substructures to bit
numbers. This is normally a rather simple function / pseudo random
generator, which could of course also be changed to something expensive
to calculate.

On a second thought, one would of course have to make sure that this
function cannot be pre-calculated for some lookup table, which might be
challenging.

Best regards,
Nils

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-22 Thread Andrew Dalke
On Apr 22, 2018, at 08:42, Nils Weskamp  wrote:
> Nice work. If brute-force approaches like this (or methods based on
> genetic algorithms etc.) are the only way to reverse a fingerprint, one
> could probably come up with a fingerprint that allows for pretty secure
> structure sharing by using many iterations of a strong cryptographic
> hash function that is really slow to calculate.

I usually think of "brute force" as something more like "enumerate all possible 
structures, generate the fingerprint for each one, and compare." I think of 
what I did here as a bit more elegant than that. ;)

I think of fingerprints as being applied to three uses:

1) identity test
if mol1 == mol2 then fp(mol1) == fp(mol2)
=> if fp(mol1) != fp(mol2) then mol1 != mol2

A good identity fingerprint has the properties that the false positive rate of
 fp(mol1) == fp(mol2) but mol1 != mol2
is low, and that working with fingerprints gives advantages over a graph 
isomorphism check.

2) Substructure screening
  if mol1 is a subgraph of mol2 then fp(mol1) is a subset of fp(mol2).
(There are different definitions of 'subset' for different fingerprint 
types.)

  An effective substructure screen fingerprint has the properties that:
  fp(mol1) is a subset of fp(mol2) => a higher likelihood that mol1 is a 
subgraph of mol2
   -and-
  fingerprint subset test is faster than subgraph isomorphism testing

3) Similarity comparison
  Use similarity of fp(mol1) and fp(mol2) as a proxy/estimate of the similarity 
of mol1 and mol2.

Usually we also assume that computing the similarity of two fingerprints is 
fast.


A cheminformatics fingerprint usually supports #1 and one or both of #2 or #3. 

If we only need #1, which is the use case Nils brought up, then we could use a 
SHA256 of the canonical SMILES string, or use the InChI Key. These are 
fixed-length binary fingerprints which can only be used for identity testing, 
and would give a low false positive rate.

The structure leakage comes from needing support for #2 and/or #3.

I don't see any reasonable way to make a fingerprint that can do #2 and/or #3 
without being open to some sort of enumeration scheme that is more clever than 
brute force.

Possibly some scheme related to homomorphic encryption might work? As I 
understand it, this would be unreasonable slow for what most people expect from 
fingerprints.

Cheers,

Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-22 Thread Nils Weskamp
Am 22.04.2018 um 03:04 schrieb Andrew Dalke:
> Here's an implementation of that sketch, applied to the RDKit hash 
> fingerprint:

Nice work. If brute-force approaches like this (or methods based on
genetic algorithms etc.) are the only way to reverse a fingerprint, one
could probably come up with a fingerprint that allows for pretty secure
structure sharing by using many iterations of a strong cryptographic
hash function that is really slow to calculate.

Nils

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss