Hi All,
I remember reading Dave’s patent when I was writing my CoG program around 2002.
I had to write my own fingerprinting algorithm, based on Daylight fingerprints,
called Fingal, which I could use to decode the fingerprints back into the
structure(s) represented by it. I also had to convince Johnny Gasteiger that
molecular fingerprints were useful.
https://pubs.acs.org/doi/abs/10.1021/ci034290p
I found it only worked on count fingerprints, since the binary one lost too
much information. Never tried it on ECFP-like fingerprints. Still have the code
somewhere…
Cheers, Nath
From: David Cosgrove <[email protected]>
Date: Monday, 23 April 2018 at 17:28
To: Brice Hoffmann <[email protected]>
Cc: RDKit Discuss <[email protected]>
Subject: Re: [Rdkit-discuss] Any known papers on reverse engineering
fingerprints into structures?
Hi all,
I’ve just had the attached from Roger Sayle, which might be of interest.
Dave
Hi Andrew and Dave,
John (Mayfield) has just pointed me at the very interesting discussion
raging on sourceforge.
Alas, I've no idea how to post/tweet/snapchat a reply, but thought I'd at
least contribute this
small nugget of trivia: US5434796A
https://patents.google.com/patent/US5434796A
where Dave W. managed to patent any application of genetic algorithms to
optimizing an
objective function of a molecular structure [which is optimistically broad,
and in its day may
even have covered Andrew's rev_eng_fp.py, which is very impressive by the
way.] This is
probably the closest thing to a publication around Dave's work with GA's
that was mentioned
on the thread (I think he only authored four or five papers in his
lifetime).
Best wishes to the both of you. Keep up the great discussion.
Cheers,
Roger
--
Roger Sayle, PhD.
CEO and founder
NextMove Software Limited
Registered in England No. 07588305<tel:07588305>
Registered Office: Innovation Centre (Unit 23), Cambridge Science Park,
Cambridge, CB4 0EY
On Mon, 23 Apr 2018 at 17:11, Brice Hoffmann
<[email protected]<mailto:[email protected]>> wrote:
Hi,
Another option is to use generative models that uses fingerprints as input (ex:
https://arxiv.org/abs/1701.01329,
https://pubs.acs.org/doi/10.1021/acs.molpharmaceut.7b00346). If you use as a
scoring function of the generated molecules the Tanimoto Distance to a given
fingerprint, you can often retrieve the original compound.
At Iktos we develop such methods and it work pretty well !
Best regards,
Brice
2018-04-23 16:18 GMT+02:00 Andrew Dalke
<[email protected]<mailto:[email protected]>>:
On Apr 23, 2018, at 14:54, Brian Cole
<[email protected]<mailto:[email protected]>> wrote:
> Unfortunately it doesn't work on circular/ECFP-like fingerprints.
To be fair, you didn't mention that was a requirement. ;)
> It has the requirement that the fingerprint be a substructure fingerprint as
> you described.
Could you elaborate on your goal?
I used RDKitFingerprint because it was the easiest. It was something I could do
in a day to demonstrate that it is possible to reverse engineer some
fingerprints.
I think it's possible to do something similar for circular fingerprints. It
would mean generating all possible subgraphs of a given radius, which is doable
for r=2 or r=3, and probably r=4. RDKit has a way to look at the circular
environment around a a specific atom rather than the entire fingerprint, so
that can be used to generate a seed point. Once that's found, it can be
expanded to one of its neighbor atoms.
Another problem is that the Morgan fingerprint algorithm really wants sanitized
structures, which I didn't need to worry about for the hash fingerprints.
Instead of a day of work, it's going to take a couple of weeks of work, which
requires time and money.
My advice though is that it's surely possible to determine some structure
information from the circular fingerprint. If your use case says there should
be no information leak (other than what's possible by full
brute-force-enumeration) then don't exchange fingerprints.
But leaking information is not really what I thought of by "reverse engineer".
For example, if I want to check if any of the Morgan fingerprints with r=2
contain a phenol, I can ask RDKit to generate the fingerprint for r=2 using
just the c(O)c as the fromAtoms. This gives:
% echo '*c1ccc(O)cc1 phenol' | rdkit2fps --from-atoms 3,4,5,6 --morgan
#FPS1
#num_bits=2048
#type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0
useBondTypes=1 fromAtoms=3,4,5,6
#software=RDKit/2016.09.3 chemfp/3.2.1
#date=2018-04-23T14:03:24
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000000000000800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000200000000000000100000000000000000000000000000000000000000000000000000000400000000000000000000000000000000004000000004000000000000000000000002000000000000000000000000000000000000000000
phenol
I can then screen using that fingerprint to see which fingerprint match.
Of the first 100,000 structures in ChEMBL, 2216 contain phenol, all of the are
detected by this screen, and there are no false positives.
Poof - structural information leakage.
The code is at the bottom of this email. It depends on the commercial version
of chemfp.
> It seems the evolutionary/genetic algorithm approach is the current
> state-of-the-art for decoding circular/ECFP-like fingerprints.
Dave Cosgrove mentioned Dave Weininger's GA work, which means it was with
Daylight hash fingerprints. I don't think we know that GAs have ever been used
to reverse engineer circular fingerprints.
> Historical question for you since you're the closest we have to a
> chem-informatician historian. :-) Why did these circular/ECFP fingerprints
> come into existence?
I believe you are asking for https://pubs.acs.org/doi/abs/10.1021/ci100050t .
Extended-connectivity fingerprints (ECFPs) are a novel class of topological
fingerprints for molecular characterization. Historically, topological
fingerprints were developed for substructure and similarity searching.
ECFPs were developed specifically for structure−activity modeling.
> my reading of the current literature is that tree/dendritic are statistically
> just as good at virtual screening as circular/ECFP:
Yeah, I don't go there. I leave concepts like "just as good" or "better" to
people who have experimental data they can use for the comparison.
Andrew
[email protected]<mailto:[email protected]>
== Code to find which Morgan fingerprints contain a phenol substructure ==
import chemfp
from chemfp import bitops, search
arena = chemfp.load_fingerprints("chembl_23_morgan.fps", reorder=False)
print("Fingerprint type:", arena.metadata.type)
# Want to find structures containing phenol
# Adjust the fingerprint type to limit it to the given atoms
fptype = chemfp.get_fingerprint_type(arena.metadata.type + " fromAtoms=3,4,5,6")
query_fp = fptype.parse_molecule_fingerprint("*c1ccc(O)cc1", "smi")
print("Query fingerprint:")
print(bitops.hex_encode(query_fp))
print()
# Find the matching fingerprints
result = search.contains_fp(query_fp, arena)
circular_ids = set(result.get_ids())
# Search the first 100,000 structures
from rdkit import Chem
from chemfp import rdkit_toolkit as T
pat = Chem.MolFromSmarts("*c1ccc(O)cc1")
all_ids = set()
exact_ids = set()
with T.read_molecules("/Users/dalke/databases/chembl_23.sdf.gz") as reader:
for mol in reader:
id = mol.GetProp("_Name")
all_ids.add(id)
if mol.HasSubstructMatch(pat):
exact_ids.add(id)
if len(all_ids) == 100000:
break
# limit the circular ids to only those checked
print("Full screen:", len(circular_ids))
circular_ids = circular_ids & all_ids
print("Relevant screen:", len(circular_ids))
print("#correct:", len(exact_ids & circular_ids))
print("#false positives:", len(circular_ids - exact_ids))
## I get the following:
# Fingerprint type: RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0
useChirality=0 useBondTypes=1
# Query fingerprint:
#
0000000000000000000000000000000000000000000000000000000000000000000000000000000
#
0000000000000000000000000000000000000000000000000000000000000000000000000000000
#
0000000000000000000000000000020000000000008000000000000000000000000000000000000
#
0000000000000000000000000000000000000000000000000000000000000000000000000000000
#
0000000000000200000000000000100000000000000000000000000000000000000000000000000
#
0000004000000000000000000000000000000000040000000040000000000000000000000020000
# 00000000000000000000000000000000000000
# Full screen: 31134
# Relevant screen: 2216
# #correct: 2216
# #false positives: 0
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
--
Brice HOFFMANN
Senior Scientist,
Molecular Modeling & Computational Chemistry
iktos.ai<http://iktos.ai/>
24 rue chaptal 75009 Paris
[cid:[email protected]]
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org!
http://sdm.link/slashdot_______________________________________________
Rdkit-discuss mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
--
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss