On Apr 22, 2018, at 08:42, Nils Weskamp <[email protected]> wrote:
> Nice work. If brute-force approaches like this (or methods based on
> genetic algorithms etc.) are the only way to reverse a fingerprint, one
> could probably come up with a fingerprint that allows for pretty secure
> structure sharing by using many iterations of a strong cryptographic
> hash function that is really slow to calculate.
I usually think of "brute force" as something more like "enumerate all possible
structures, generate the fingerprint for each one, and compare." I think of
what I did here as a bit more elegant than that. ;)
I think of fingerprints as being applied to three uses:
1) identity test
if mol1 == mol2 then fp(mol1) == fp(mol2)
=> if fp(mol1) != fp(mol2) then mol1 != mol2
A good identity fingerprint has the properties that the false positive rate of
fp(mol1) == fp(mol2) but mol1 != mol2
is low, and that working with fingerprints gives advantages over a graph
isomorphism check.
2) Substructure screening
if mol1 is a subgraph of mol2 then fp(mol1) is a subset of fp(mol2).
(There are different definitions of 'subset' for different fingerprint
types.)
An effective substructure screen fingerprint has the properties that:
fp(mol1) is a subset of fp(mol2) => a higher likelihood that mol1 is a
subgraph of mol2
-and-
fingerprint subset test is faster than subgraph isomorphism testing
3) Similarity comparison
Use similarity of fp(mol1) and fp(mol2) as a proxy/estimate of the similarity
of mol1 and mol2.
Usually we also assume that computing the similarity of two fingerprints is
fast.
A cheminformatics fingerprint usually supports #1 and one or both of #2 or #3.
If we only need #1, which is the use case Nils brought up, then we could use a
SHA256 of the canonical SMILES string, or use the InChI Key. These are
fixed-length binary fingerprints which can only be used for identity testing,
and would give a low false positive rate.
The structure leakage comes from needing support for #2 and/or #3.
I don't see any reasonable way to make a fingerprint that can do #2 and/or #3
without being open to some sort of enumeration scheme that is more clever than
brute force.
Possibly some scheme related to homomorphic encryption might work? As I
understand it, this would be unreasonable slow for what most people expect from
fingerprints.
Cheers,
Andrew
[email protected]
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss