Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

Brice Hoffmann Mon, 23 Apr 2018 09:12:17 -0700

Hi,
Another option is to use generative models that uses fingerprints as input
(ex: https://arxiv.org/abs/1701.01329,
https://pubs.acs.org/doi/10.1021/acs.molpharmaceut.7b00346). If you use as
a scoring function of the generated molecules the Tanimoto Distance to a
given fingerprint, you can often retrieve the original compound.
At Iktos we develop such methods and it work pretty well !
Best regards,
Brice




2018-04-23 16:18 GMT+02:00 Andrew Dalke <[email protected]>:

> On Apr 23, 2018, at 14:54, Brian Cole <[email protected]> wrote:
> > Unfortunately it doesn't work on circular/ECFP-like fingerprints.
>
> To be fair, you didn't mention that was a requirement. ;)
>
> > It has the requirement that the fingerprint be a substructure
> fingerprint as you described.
>
> Could you elaborate on your goal?
>
> I used RDKitFingerprint because it was the easiest. It was something I
> could do in a day to demonstrate that it is possible to reverse engineer
> some fingerprints.
>
> I think it's possible to do something similar for circular fingerprints.
> It would mean generating all possible subgraphs of a given radius, which is
> doable for r=2 or r=3, and probably r=4. RDKit has a way to look at the
> circular environment around a a specific atom rather than the entire
> fingerprint, so that can be used to generate a seed point. Once that's
> found, it can be expanded to one of its neighbor atoms.
>
> Another problem is that the Morgan fingerprint algorithm really wants
> sanitized structures, which I didn't need to worry about for the hash
> fingerprints.
>
> Instead of a day of work, it's going to take a couple of weeks of work,
> which requires time and money.
>
> My advice though is that it's surely possible to determine some structure
> information from the circular fingerprint. If your use case says there
> should be no information leak (other than what's possible by full
> brute-force-enumeration) then don't exchange fingerprints.
>
> But leaking information is not really what I thought of by "reverse
> engineer".
>
> For example, if I want to check if any of the Morgan fingerprints with r=2
> contain a phenol, I can ask RDKit to generate the fingerprint for r=2 using
> just the c(O)c as the fromAtoms. This gives:
>
>
> % echo '*c1ccc(O)cc1 phenol' | rdkit2fps --from-atoms 3,4,5,6 --morgan
> #FPS1
> #num_bits=2048
> #type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0
> useBondTypes=1 fromAtoms=3,4,5,6
> #software=RDKit/2016.09.3 chemfp/3.2.1
> #date=2018-04-23T14:03:24
> 000000000000000000000000000000000000000000000000000000000000
> 000000000000000000000000000000000000000000000000000000000000
> 000000000000000000000000000000000000000000000000000000000000
> 000000020000000000008000000000000000000000000000000000000000
> 000000000000000000000000000000000000000000000000000000000000
> 000000000000000000000000000002000000000000001000000000000000
> 000000000000000000000000000000000000000004000000000000000000
> 000000000000000040000000040000000000000000000000020000000000
> 00000000000000000000000000000000        phenol
>
>
> I can then screen using that fingerprint to see which fingerprint match.
>
> Of the first 100,000 structures in ChEMBL, 2216 contain phenol, all of the
> are detected by this screen, and there are no false positives.
>
> Poof - structural information leakage.
>
> The code is at the bottom of this email. It depends on the commercial
> version of chemfp.
>
>
> > It seems the evolutionary/genetic algorithm approach is the current
> state-of-the-art for decoding circular/ECFP-like fingerprints.
>
> Dave Cosgrove mentioned Dave Weininger's GA work, which means it was with
> Daylight hash fingerprints. I don't think we know that GAs have ever been
> used to reverse engineer circular fingerprints.
>
>
> > Historical question for you since you're the closest we have to a
> chem-informatician historian. :-) Why did these circular/ECFP fingerprints
> come into existence?
>
> I believe you are asking for https://pubs.acs.org/doi/abs/
> 10.1021/ci100050t .
>
>   Extended-connectivity fingerprints (ECFPs) are a novel class of
> topological
>   fingerprints for molecular characterization. Historically, topological
>   fingerprints were developed for substructure and similarity searching.
>   ECFPs were developed specifically for structure−activity modeling.
>
> > my reading of the current literature is that tree/dendritic are
> statistically just as good at virtual screening as circular/ECFP:
>
> Yeah, I don't go there. I leave concepts like "just as good" or "better"
> to people who have experimental data they can use for the comparison.
>
>
>                                 Andrew
>                                 [email protected]
>
> == Code to find which Morgan fingerprints contain a phenol substructure ==
>
> import chemfp
> from chemfp import bitops, search
>
> arena = chemfp.load_fingerprints("chembl_23_morgan.fps", reorder=False)
> print("Fingerprint type:", arena.metadata.type)
>
> # Want to find structures containing phenol
>
> # Adjust the fingerprint type to limit it to the given atoms
> fptype = chemfp.get_fingerprint_type(arena.metadata.type + "
> fromAtoms=3,4,5,6")
> query_fp = fptype.parse_molecule_fingerprint("*c1ccc(O)cc1", "smi")
>
> print("Query fingerprint:")
> print(bitops.hex_encode(query_fp))
> print()
>
> # Find the matching fingerprints
> result = search.contains_fp(query_fp, arena)
>
> circular_ids = set(result.get_ids())
>
> # Search the first 100,000 structures
> from rdkit import Chem
> from chemfp import rdkit_toolkit as T
>
> pat = Chem.MolFromSmarts("*c1ccc(O)cc1")
> all_ids = set()
> exact_ids = set()
> with T.read_molecules("/Users/dalke/databases/chembl_23.sdf.gz") as
> reader:
>     for mol in reader:
>         id = mol.GetProp("_Name")
>         all_ids.add(id)
>         if mol.HasSubstructMatch(pat):
>             exact_ids.add(id)
>         if len(all_ids) == 100000:
>             break
>
> # limit the circular ids to only those checked
> print("Full screen:", len(circular_ids))
> circular_ids = circular_ids & all_ids
> print("Relevant screen:", len(circular_ids))
>
> print("#correct:", len(exact_ids & circular_ids))
> print("#false positives:", len(circular_ids - exact_ids))
>
> ## I get the following:
> # Fingerprint type: RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0
> useChirality=0 useBondTypes=1
> # Query fingerprint:
> # 000000000000000000000000000000000000000000000000000000000000
> 0000000000000000000
> # 000000000000000000000000000000000000000000000000000000000000
> 0000000000000000000
> # 000000000000000000000000000002000000000000800000000000000000
> 0000000000000000000
> # 000000000000000000000000000000000000000000000000000000000000
> 0000000000000000000
> # 000000000000020000000000000010000000000000000000000000000000
> 0000000000000000000
> # 000000400000000000000000000000000000000004000000004000000000
> 0000000000000020000
> # 00000000000000000000000000000000000000
> # Full screen: 31134
> # Relevant screen: 2216
> # #correct: 2216
> # #false positives: 0
>
>
>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>



-- 
Brice HOFFMANN
Senior Scientist,
Molecular Modeling & Computational Chemistry
iktos.ai
24 rue chaptal 75009 Paris

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

Reply via email to