Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?
On Apr 22, 2018, at 20:22, Nils Weskampwrote: > Actually, I *was* also thinking about your use cases 2 and 3 since you > also need some form of hash function to map substructures to bit > numbers. This is normally a rather simple function / pseudo random > generator, Strictly speaking, this is not a requirement. The term "fingerprint" has taken on quite an encompassing meaning since 1990. The molecular formula is a count fingerprint with 118 keys, based on the atomic number. There's no need for hash function there. "CCO" might be: [0, 0, 0, 0, 0, 2, 0, 1, ...] Or it can be written in more compact form like {"C": 2, "O": 1}. As an alternative, I could use a mapping from canonical substructures to counts, so "CCO" becomes: {"C": 2, "O": 1, "CC": 1, "CO": 1, "CCO": 1} This doesn't require a hash. (While I represent that as a Python dictionary, which uses a hash table underneath, it could be implemented using a red-black tree or B-tree, or with a simple linear search.) It's only if I want to convert this into fixed length representation where I have to figure out some sort of encoding scheme. Even then, I don't need a PRNG or hash seed. Suppose I use a bit vector. I could have a table which maps all canonical substructures to its bit pattern. If I have an unknown fragment, I could use RANDOM.ORG to get the bits. Downsides include potentially unbounded table growth and the need for a centralized table. This is the approach that Zatocoding used, and I see Chemical Zatocoding as the only precursor to Daylight hash fingerprints. > which could of course also be changed to something expensive to calculate. Yes, that could be possible. Abstractly, let the first 20 bytes of each fingerprint be a salt, and use something like bcrypt so each fingerprint test requires that the query structure be re-fingerprinted for the per-fingerprint hash function. It would, however, take an absurdly long time to do a similarity search. And in any case, before going further along that path, we would need to figure out the risk model. Brian started by saying that he wanted to obfuscate molecules for security, but didn't say what he want to use them for, and if he want to secure them against nation-states, or simply against me. ;) Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?
Hi Andrew, Am 22.04.2018 um 19:35 schrieb Andrew Dalke: > I think of what I did here as a bit more elegant than that. ;) I should have have looked at the code more carefully before commenting. ;) Nevertheless, you will probably still need many steps for complex structures - although not as many as I anticipated. Actually, I *was* also thinking about your use cases 2 and 3 since you also need some form of hash function to map substructures to bit numbers. This is normally a rather simple function / pseudo random generator, which could of course also be changed to something expensive to calculate. On a second thought, one would of course have to make sure that this function cannot be pre-calculated for some lookup table, which might be challenging. Best regards, Nils -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?
On Apr 22, 2018, at 08:42, Nils Weskampwrote: > Nice work. If brute-force approaches like this (or methods based on > genetic algorithms etc.) are the only way to reverse a fingerprint, one > could probably come up with a fingerprint that allows for pretty secure > structure sharing by using many iterations of a strong cryptographic > hash function that is really slow to calculate. I usually think of "brute force" as something more like "enumerate all possible structures, generate the fingerprint for each one, and compare." I think of what I did here as a bit more elegant than that. ;) I think of fingerprints as being applied to three uses: 1) identity test if mol1 == mol2 then fp(mol1) == fp(mol2) => if fp(mol1) != fp(mol2) then mol1 != mol2 A good identity fingerprint has the properties that the false positive rate of fp(mol1) == fp(mol2) but mol1 != mol2 is low, and that working with fingerprints gives advantages over a graph isomorphism check. 2) Substructure screening if mol1 is a subgraph of mol2 then fp(mol1) is a subset of fp(mol2). (There are different definitions of 'subset' for different fingerprint types.) An effective substructure screen fingerprint has the properties that: fp(mol1) is a subset of fp(mol2) => a higher likelihood that mol1 is a subgraph of mol2 -and- fingerprint subset test is faster than subgraph isomorphism testing 3) Similarity comparison Use similarity of fp(mol1) and fp(mol2) as a proxy/estimate of the similarity of mol1 and mol2. Usually we also assume that computing the similarity of two fingerprints is fast. A cheminformatics fingerprint usually supports #1 and one or both of #2 or #3. If we only need #1, which is the use case Nils brought up, then we could use a SHA256 of the canonical SMILES string, or use the InChI Key. These are fixed-length binary fingerprints which can only be used for identity testing, and would give a low false positive rate. The structure leakage comes from needing support for #2 and/or #3. I don't see any reasonable way to make a fingerprint that can do #2 and/or #3 without being open to some sort of enumeration scheme that is more clever than brute force. Possibly some scheme related to homomorphic encryption might work? As I understand it, this would be unreasonable slow for what most people expect from fingerprints. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?
Am 22.04.2018 um 03:04 schrieb Andrew Dalke: > Here's an implementation of that sketch, applied to the RDKit hash > fingerprint: Nice work. If brute-force approaches like this (or methods based on genetic algorithms etc.) are the only way to reverse a fingerprint, one could probably come up with a fingerprint that allows for pretty secure structure sharing by using many iterations of a strong cryptographic hash function that is really slow to calculate. Nils -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss