On Apr 23, 2018, at 14:54, Brian Cole <[email protected]> wrote:
> Unfortunately it doesn't work on circular/ECFP-like fingerprints.
To be fair, you didn't mention that was a requirement. ;)
> It has the requirement that the fingerprint be a substructure fingerprint as
> you described.
Could you elaborate on your goal?
I used RDKitFingerprint because it was the easiest. It was something I could do
in a day to demonstrate that it is possible to reverse engineer some
fingerprints.
I think it's possible to do something similar for circular fingerprints. It
would mean generating all possible subgraphs of a given radius, which is doable
for r=2 or r=3, and probably r=4. RDKit has a way to look at the circular
environment around a a specific atom rather than the entire fingerprint, so
that can be used to generate a seed point. Once that's found, it can be
expanded to one of its neighbor atoms.
Another problem is that the Morgan fingerprint algorithm really wants sanitized
structures, which I didn't need to worry about for the hash fingerprints.
Instead of a day of work, it's going to take a couple of weeks of work, which
requires time and money.
My advice though is that it's surely possible to determine some structure
information from the circular fingerprint. If your use case says there should
be no information leak (other than what's possible by full
brute-force-enumeration) then don't exchange fingerprints.
But leaking information is not really what I thought of by "reverse engineer".
For example, if I want to check if any of the Morgan fingerprints with r=2
contain a phenol, I can ask RDKit to generate the fingerprint for r=2 using
just the c(O)c as the fromAtoms. This gives:
% echo '*c1ccc(O)cc1 phenol' | rdkit2fps --from-atoms 3,4,5,6 --morgan
#FPS1
#num_bits=2048
#type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0
useBondTypes=1 fromAtoms=3,4,5,6
#software=RDKit/2016.09.3 chemfp/3.2.1
#date=2018-04-23T14:03:24
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000000000000800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000200000000000000100000000000000000000000000000000000000000000000000000000400000000000000000000000000000000004000000004000000000000000000000002000000000000000000000000000000000000000000
phenol
I can then screen using that fingerprint to see which fingerprint match.
Of the first 100,000 structures in ChEMBL, 2216 contain phenol, all of the are
detected by this screen, and there are no false positives.
Poof - structural information leakage.
The code is at the bottom of this email. It depends on the commercial version
of chemfp.
> It seems the evolutionary/genetic algorithm approach is the current
> state-of-the-art for decoding circular/ECFP-like fingerprints.
Dave Cosgrove mentioned Dave Weininger's GA work, which means it was with
Daylight hash fingerprints. I don't think we know that GAs have ever been used
to reverse engineer circular fingerprints.
> Historical question for you since you're the closest we have to a
> chem-informatician historian. :-) Why did these circular/ECFP fingerprints
> come into existence?
I believe you are asking for https://pubs.acs.org/doi/abs/10.1021/ci100050t .
Extended-connectivity fingerprints (ECFPs) are a novel class of topological
fingerprints for molecular characterization. Historically, topological
fingerprints were developed for substructure and similarity searching.
ECFPs were developed specifically for structure−activity modeling.
> my reading of the current literature is that tree/dendritic are statistically
> just as good at virtual screening as circular/ECFP:
Yeah, I don't go there. I leave concepts like "just as good" or "better" to
people who have experimental data they can use for the comparison.
Andrew
[email protected]
== Code to find which Morgan fingerprints contain a phenol substructure ==
import chemfp
from chemfp import bitops, search
arena = chemfp.load_fingerprints("chembl_23_morgan.fps", reorder=False)
print("Fingerprint type:", arena.metadata.type)
# Want to find structures containing phenol
# Adjust the fingerprint type to limit it to the given atoms
fptype = chemfp.get_fingerprint_type(arena.metadata.type + " fromAtoms=3,4,5,6")
query_fp = fptype.parse_molecule_fingerprint("*c1ccc(O)cc1", "smi")
print("Query fingerprint:")
print(bitops.hex_encode(query_fp))
print()
# Find the matching fingerprints
result = search.contains_fp(query_fp, arena)
circular_ids = set(result.get_ids())
# Search the first 100,000 structures
from rdkit import Chem
from chemfp import rdkit_toolkit as T
pat = Chem.MolFromSmarts("*c1ccc(O)cc1")
all_ids = set()
exact_ids = set()
with T.read_molecules("/Users/dalke/databases/chembl_23.sdf.gz") as reader:
for mol in reader:
id = mol.GetProp("_Name")
all_ids.add(id)
if mol.HasSubstructMatch(pat):
exact_ids.add(id)
if len(all_ids) == 100000:
break
# limit the circular ids to only those checked
print("Full screen:", len(circular_ids))
circular_ids = circular_ids & all_ids
print("Relevant screen:", len(circular_ids))
print("#correct:", len(exact_ids & circular_ids))
print("#false positives:", len(circular_ids - exact_ids))
## I get the following:
# Fingerprint type: RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0
useChirality=0 useBondTypes=1
# Query fingerprint:
#
0000000000000000000000000000000000000000000000000000000000000000000000000000000
#
0000000000000000000000000000000000000000000000000000000000000000000000000000000
#
0000000000000000000000000000020000000000008000000000000000000000000000000000000
#
0000000000000000000000000000000000000000000000000000000000000000000000000000000
#
0000000000000200000000000000100000000000000000000000000000000000000000000000000
#
0000004000000000000000000000000000000000040000000040000000000000000000000020000
# 00000000000000000000000000000000000000
# Full screen: 31134
# Relevant screen: 2216
# #correct: 2216
# #false positives: 0
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss